Post

The Darwin Dataset

The Darwin Dataset

Designing a System for National Rail’s Darwin Push Port Feed

Over the past few months, I have been designing a system for ingesting data from National Rail’s “Darwin Push Port Feed”. This feed contains live information about train movements, scheduling, and service alterations across the UK rail network. In this article, I want to walk through the final structure of the data so any new users can quickly understand and work with the dataset.

Overview of the Dataset

The dataset contains two separate streams of data, stored in S3 in parquet format. One stream encodes information about services, and the other about locations. While the data received from the Darwin feed is heavily nested JSON, the aim of this dataset was to split up entities as much as possible to make it easy to query via SQL. On ingestion into the system, each message gets parsed and split into multiple rows in each table.


The Service Table

The services table encodes information about the service itself, such as identifiers, timestamps, and other metadata. For each message received, a new entry is made in the service table capturing the listed metadata from the message. This means each service row is more accurately described as a snapshot of the metadata for a service at a given moment.

Note: Different message types may contain different metadata, which can cause inconsistencies.

Schema for the Service Table

Column NameDatatypeMessage TypeComment
ridstringAll messagesThe unique identifier for a service. Given as YYYYMMDD{uid}
uidstringAll messagesA unique ID for a service on a given day
tsstringAll messagesThe timestamp of the message on ingestion
passengerbooleanSchedule onlyDenotes if a service is a passenger service or not
update_idstringAll messagesA UUID identifying the unique ID of this message
tocstringSchedule onlyThe identifier for the train operating company
train_idstringSchedule onlyThe headcode of a given service
yearstringAll messagesThe year the message was ingested
monthstringAll messagesThe month the message was ingested
daystringAll messagesThe day the message was ingested

The Locations Table

The locations table contains individual records for each stop a service is scheduled, estimated, or makes over the course of a journey. These rows are joined together by the service_update_id, which corresponds to the update_id from the service table.

Schema for the Locations Table

Column NameDatatypeComment
tplstringThe TPL location code for the location
typestringEncodes either ARR/DEP
time_typestringEncodes either: SCHED (Scheduled), EST (Estimated), or ACT (Actual) timings for the location
timestringThe time of the location, encoded as HH:MM:SS
location_idstringUUID uniquely identifying the location
service_update_idstringThe service update the location was from (Foreign Key for service update_id)
yearstringThe year the message was ingested
monthstringThe month the message was ingested
daystringThe day the message was ingested

Query Examples

Retrieve All Service IDs for a Given Day

1
2
3
4
5
6
SELECT rid, uid
FROM services_v1
WHERE year = '2024'
  AND month = '12'
  AND day = '28'
GROUP BY rid, uid;

Retrieve All Locations for a Specific Service ID

1
2
3
4
5
6
7
8
9
SELECT tpl, type, time_type, time
FROM services_v1
LEFT JOIN locations_v1 
  ON services_v1.update_id = locations_v1.service_update_id
WHERE services_v1.year = '2024'
  AND services_v1.month = '12'
  AND services_v1.day = '28'
  AND services_v1.rid = '202412288013192'
ORDER BY time;

Appendix: Data Indexing and Partitioning

To run these queries, I have indexed the data via AWS Glue using standard Glue crawlers. This created the two tables described above. You can do the same!

Tip: The data is partitioned in Hive style using year, month, and day. Using these partitions in your queries will drastically reduce query costs. Learn more here.