9. Storing Feed Data in a Database

This chapter will demonstrate how to save data from a GTFS-realtime feed into an SQLite database. In order to look up trips, stops, routes and other data from the GTFS-realtime feed, this SQLite database will also contain the data from the corresponding GTFS feed.

To do this, the GtfsToSql and GtfsRealTimeToSql tools which have been written for the purpose of this book and the preceding book, The Definitive Guide to GTFS (http://gtfsbook.com), will be used.

GTFS and GTFS-realtime feeds from the MBTA in Boston (https://openmobilitydata.org/p/mbta) will also be used.

Storing GTFS Data in an SQLite Database

The Definitive Guide to GTFS demonstrated how to populate an SQLite database using GtfsToSql. Here is an abbreviated version of the steps required to do so.

First, download GtfsToSql from https://github.com/OpenMobilityData/GtfsToSql. This is a Java command-line application to import a GTFS feed into an SQLite database.

The pre-compiled GtfsToSql Java archive can be downloaded from its GitHub repository at https://github.com/OpenMobilityData/GtfsToSql/tree/master/dist.

Next, download the MBTA GTFS feed, available from http://www.mbta.com/uploadedfiles/MBTA_GTFS.zip.

$ curl http://www.mbta.com/uploadedfiles/MBTA_GTFS.zip -o gtfs.zip

$ unzip gtfs.zip -d mbta/

To create an SQLite database from this feed, the following command can be used:

$ java -jar GtfsToSql.jar -s jdbc:sqlite:./db.sqlite -g ./mbta -o

Note: The -o flag enables the recording of additional useful data in the database. For instance, each entry in the trips table will contain the departure and arrival time (which would otherwise only be available by looking up the stop_times table).

This may take a minute or two to complete (you will see progress as it imports the feed and then creates indexes), and at the end you will have a GTFS database in a file called db.sqlite. You can then query this database with the command-line sqlite3 tool, as shown in the following example:

$ sqlite3 db.sqlite

sqlite> SELECT agency_id, agency_name, agency_url, agency_lang FROM agency;

2|Massport|http://www.massport.com|EN

1|MBTA|[http://www.mbta.com](http://www.mbta.com/)|EN

Note: For more information about storing and querying GTFS data, please refer to The Definitive Guide to GTFS, available from http://gtfsbook.com.

Storing GTFS-realtime Data in an SQLite Database

Once the GTFS data has been imported into the SQLite database, you can then start importing GTFS-realtime data. For this you can use the GtfsRealTimeToSql tool specifically written to import data into an SQLite database.

Note: Technically you do not need the GTFS data present in the database as well, but having it makes it far simpler to resolve trip, route and stop data.

The GTFS data will change infrequently (perhaps every few weeks or months), while the realtime data can change several times per minute. GtfsRealTimeToSql will frequently update a feed, according to the refresh time specified. Each time it updates, the data saved on the previous iteration is deleted, as that data is no longer the most up-to-date data.

To get started, download GtfsRealTimeToSql from its GitHub repository at https://github.com/OpenMobilityData/GtfsRealTimeToSql. Just like GtfsToSql, this is a Java command-line application. The pre-compiled GtfsRealTimeToSql Java archive can be downloaded from https://github.com/OpenMobilityData/GtfsRealTimeToSql/tree/master/dist.

Since the db.sqlite database contains the MBTA GTFS feed, you can now run GtfsRealTimeToSql with one of MBTA's GTFS-realtime feeds. For instance, their vehicle positions feed is located at http://developer.mbta.com/lib/gtrtfs/Vehicles.pb.

java -jar GtfsRealTimeToSql.jar \
  -u "http://developer.mbta.com/lib/gtrtfs/Vehicles.pb" \
  -s jdbc:sqlite:./db.sqlite \
  -refresh 15

When you run this command, the vehicle positions feed will be retrieved every 15 seconds (specified by the -refresh parameter), and the data will be saved into the db.sqlite SQLite database.

MBTA also have a trip updates feed and a service alerts feed. In order to load each of these feeds you will need to run GtfsRealTimeToSql three separate times. To allow it to run in the background, add the -d parameter (to make it run as a server daemon), and background it using &.

To load the vehicle positions in the background, stop the previous command, then run the following command instead.

java -jar GtfsRealTimeToSql.jar \
  -u "http://developer.mbta.com/lib/gtrtfs/Vehicles.pb" \
  -s jdbc:sqlite:./db.sqlite \
  -refresh 15 \
  -d &

Now you can also load the trip updates into the same db.sqlite file. MBTA's trip updates feed is located at http://developer.mbta.com/lib/gtrtfs/Passages.pb. The following command reloads the trip updates every 30 seconds:

java -jar GtfsRealTimeToSql.jar \
  -u "http://developer.mbta.com/lib/gtrtfs/Passages.pb" \
  -s jdbc:sqlite:./db.sqlite \
  -refresh 30 \
  -d &

Note: You may prefer more frequent updates (such as 15 seconds), or even less frequent (such as 60 seconds). The more frequently you update, the greater your server utilization will be.

Finally, to load the service alerts feed, use the same command, but now with the feed located at http://developer.mbta.com/lib/GTRTFS/Alerts/Alerts.pb. Generally, service alerts are updated infrequently by providers, so you can use a refresh time such as five or ten minutes (300 or 600 seconds).

java -jar GtfsRealTimeToSql.jar \
  -u "http://developer.mbta.com/lib/gtrtfs/Alerts/Alerts.pb" \
  -s jdbc:sqlite:./db.sqlite \
  -refresh 300 \
  -d &

These three feeds will continue to be reloaded until you terminate their respective processes.

Querying Vehicle Positions

When using GtfsRealTimeToSql, vehicle positions are stored in a table called gtfs_rt_vehicles. If you want to retrieve positions for, say, the route with an ID of 742, you can run the following query:

$ sqlite3 db.sqlite

sqlite> SELECT route_id, trip_id, trip_date, trip_time, trip_sr, latitude, longitude
  FROM gtfs_rt_vehicles
  WHERE route_id = 742;

This query returns the trip descriptor and GPS coordinates for all vehicles on the given route. The following table shows sample results from this query.

route_idtrip_idtrip_datetrip_timetrip_srlatitudelongitude
7422590086020150315042.347309-71.040359
74242.331505-71.065590

Note: The trip_sr column corresponds to the trip's schedule relationship value.

This particular snapshot of vehicle position data shows two different trips for route 742, which corresponds to the SL2 Silver Line.

sqlite> SELECT route_short_name, route_long_name FROM routes WHERE
route_id = 742;

SL2|Silver Line SL2

The first five columns retrieved are the trip identifier fields, which are used to link a vehicle position with a trip from the GTFS feed. The first row is simple: the trip_id value can be used to look up the row from the trips table.

sqlite> SELECT service_id, trip_headsign, direction_id, block_id

FROM trips

WHERE trip_id = '25900860';

The results from this query are as follows.

service_idtrip_headsigndirection_idblock_id
BUSS12015-hbs15017-Sunday-02South Station1S742-61

Note: To see all fields available in this or any other table (including gtfs_rt_vehicles), use the .schema command in SQLite. For instance, enter the command .schema gtfs_rt_vehicles.

One helpful thing MBTA does is to include the trip starting date when they include the trip ID. This helps to disambiguate trips that may start or finish around midnight or later.

The second vehicle position is another matter. This record does not include any trip descriptor information other than the route ID. This means you cannot reliably link this vehicle position with a specific trip from the GTFS feed.

However, knowing the route ID and the vehicle's coordinates may be enough to present useful information to the user: "A bus on the Silver Line SL2 route is at this location." You cannot show them if the vehicle is running late, early or on-time, but you can show the user where the vehicle is.

Querying Trip Updates

The GtfsRealTimeToSql tool stores trip update data in two tables: one for the main trip update data (such as the trip descriptor), and another to store each individual stop time event that belongs within each update.

For example, to retrieve all trip updates for the route with ID 742 once again, you could use the following query:

SELECT route_id, trip_id, trip_date, trip_time, trip_sr, vehicle_id, vehicle_label
  FROM gtfs_rt_trip_updates
  WHERE route_id = '742';

This query will return data similar to the following table:

route_idtrip_idtrip_datetrip_timetrip_srvehicle_idvehicle_label
74225900860201503150y11061106
74225900856201503150
74225900858201503150

Each of these returned records has corresponding records in the gtfs_rt_trip_updates_stoptimes table that includes the arrival/departure estimates for various stops on the trip.

When using GtfsRealTimeToSql, an extra column called update_id is included on both tables so they can be linked together. For example, to retrieve all updates for the route ID 742, you can join the tables together as follows:

SELECT trip_id, arrival_time, arrival_delay, departure_time, departure_delay, stop_id, stop_sequence
  FROM gtfs_rt_trip_updates_stoptimes
  JOIN gtfs_rt_trip_updates USING (update_id)
  WHERE route_id = '742';

MBTA's trip updates feed only includes a stop time prediction for the next stop. This means that each trip in the results only has one corresponding record. You must then apply the delay to subsequent stop times for the given trip.

trip_idarrival_timearrival_delaydeparture_timedeparture_delaystop_idstop_sequence
25900860307461710
259008560746111
259008580312551

There are several things to note in these results. Firstly, MBTA do not provide a timestamp for arrival_time and departure_time; they only provide the delay offsets that can be compared to the scheduled time in the GTFS feed.

Secondly, the second and third trips listed have not yet commenced. You can deduce this just by looking at this table, since the next stop has stop sequence of 1 (and therefore there are no stops before it).

The first trip is delayed by 30 seconds. In other words, it will arrive 30 seconds later than it was scheduled. The data received from the GTFS-realtime feed does not actually indicate the arrival timestamp, but you can look up the corresponding stop time from the GTFS feed to determine this.

The following query demonstrates how to look up the arrival time:

SELECT s.stop_name, st.arrival_time
  FROM stop_times st, trips t, stops s
  WHERE st.trip_index = t.trip_index
  AND st.stop_index = s.stop_index
  AND s.stop_id = '74617'
  AND t.trip_id = '25900860'
  AND st.stop_sequence = 10;

Note: The GtfsToSql tool used to import the GTFS feed adds fields such as trip_index and stop_index in order to speed up data searching. There is more discussion on the rationale of this in The Definitive Guide to GTFS, available from http://gtfsbook.com.

This query joins the stop_times table to both the trips and stops table in order to look up the corresponding arrival time. The final three rows in the query contain the values returned above (the stop_id, trip_id and the stop_sequence), to find the following record:

South Station Silver Line - Inbound|21:17:00

This means the scheduled arrival time for South Station Silver Line is 9:17:00 PM. Since the estimate indicates a delay of 30 seconds, the new arrival time is 9:17:30 PM.

Note: Conversely, if the delay was -30 instead of 30, the vehicle would be early and arrive at 9:16:30 PM.

One thing not touched upon in this example is the stop time's schedule relationship. The gtfs_rt_trip_updates_stoptimes also includes a column called rship, which is used to indicate the schedule relationship for the given stop. If this value was skipped (a value of 1), then it means the vehicle will stop here.

Determining Predictions Using Blocks

In GTFS, the block_id value in trips.txt is used to indicate a series of one or more trips undertaken by a single vehicle. In other words, once it gets to the end of one trip, it starts a new trip from that location. If a given trip is very short, a vehicle may perform up to fifty or one hundred trips in a single day.

This can be useful for determining estimates for future trips that may not be included in the GTFS feed. For instance, if a trip is running 30 minutes late, then it is highly likely that subsequent trips on that block will also be running late.

Referring back to the trip data returned in the above example, the following data can be retrieved from the GTFS feed:

SELECT trip_id, block_id, service_id, departure_time, arrival_time
  FROM trips
  WHERE trip_id IN ('25900860', '25900856', '25900858')
  ORDER BY departure_time;

This returns the following data.

trip_idblock_idservice_iddeparture_timearrival_time
25900860S742-61BUSS12015-hbs15017-Sunday-0221:04:0021:17:00
25900856S742-61BUSS12015-hbs15017-Sunday-0221:18:0021:28:00
25900858S742-61BUSS12015-hbs15017-Sunday-0221:35:0021:48:00

Each of these trips have the same values for block_id and service_id, meaning the same vehicle will complete all three trips. The data indicates that the first trip is running 30 seconds late, so its arrival time will be 21:17:30.

Because the second trip is scheduled to depart at 21:18:00, there is a buffer time to catch up (in other words, the first trip is only 30 seconds late, so hopefully it will not impact the second trip).

If, for instance, the second trip ran 10 minutes late (and therefore arrived at 21:38:00 instead of 21:28:00), you could reasonably assume the third trip would begin at about 21:38:00 instead of 21:35:00 (about three minutes late).

Querying Service Alerts

When using GtfsRealTimeToSql to store service alerts, data is stored in three tables. The main service alert information is stored in gtfs_rt_alerts. Since there can be any number of time ranges or affected entities for any given alert, there is a table to hold time ranges (gtfs_rt_alerts_timeranges) and one to hold affected entities (gtfs_rt_alerts_entities).

In order to link this data together, each alert has an alert_id value (created by GtfsRealTimeToSql) which is also present for any corresponding time range and affected entities records.

To retrieve service alerts from the database, you can use the following query:

SELECT alert_id, header, description, cause, effect FROM gtfs_rt_alerts;

At the time of writing, this yields the following results. The description has been shortened as they are quite long and descriptive in the original feed.

alert_idheaderdescriptioncauseeffect
6Route 11 detourRoute 11 outbound detoured due to ...14
11Ruggles elevator unavailable... Commuter Rail platform to the lobby is unavailable ...97
33Extra Franklin Line service15

**Note: **A cause value of 1 corresponds to UNKNOWN_CAUSE, while 9 corresponds to MAINTENANCE. An effect value of 4 corresponds to DETOUR, 7 corresponds to OTHER_EFFECT, while 5 corresponds to ADDITIONAL_SERVICE.

To determine the timing of these alerts, look up the gtfs_rt_alerts_timeranges for the given alerts.

$ export TZ=America/New_York

$ sqlite3 db.sqlite

sqlite> SELECT alert_id, datetime(start, 'unixepoch', 'localtime'), datetime(finish, 'unixepoch', 'localtime')
  FROM gtfs_rt_alerts_timeranges
  WHERE alert_id IN (6, 11, 33);

Note: In this example, the system timezone has been temporarily changed to America/New_York (the timezone specified in MBTA's agency.txt file) so the timestamps are formatted correctly. You may prefer instead to format the start and finish timestamps using your programming language of choice.

alert_idstart (formatted)finish (formatted)
62015-02-17 15:56:52
112015-03-18 06:00:002015-03-18 16:00:00
112015-03-19 06:00:002015-03-19 16:00:00
332015-03-16 10:58:382015-03-18 02:30:00

These dates indicate the following:

  • The alert with an ID of 6 began on February 17 and has no specified end date.
  • The alert with an ID of 11 will occur over two days between 6 AM and 4 PM. *The alert with an ID of 33 will last for almost two days.

Finally, to determine which entities (routes, trips, stops) are affected by these alerts, query the gtfs_rt_alerts_entities table.

SELECT alert_id, agency_id, route_id, route_type, stop_id, trip_id, trip_start_date, trip_start_time, trip_rship
  FROM gtfs_rt_alerts_entities WHERE alert_id IN (6, 11, 33);

This results in the following data, describing only an entity for the final service alert.

alert_idagency_idroute_idroute_typestop_idtrip_idtrip_start_datetrip_start_timetrip_rship
61CR-Franklin2

Using the route_id value, you can find the route from the GTFS feed:

SELECT agency_id, route_type, route_long_name, route_desc
  FROM routes WHERE route_id = 'CR-Franklin';

This query will result in the following data:

agency_idroute_typeroute_long_nameroute_desc
12Franklin LineCommuter Rail

In effect, this means that if you have a web site or app displaying the schedule for the Franklin line, this alert should be displayed so the people who use the line are aware of the change.

Note: Even though the agency_id and route_type values are included, you do not really need these since the route_id can only refer to one row in the GTFS feed. If instead you wanted to refer to ALL rail lines, the alert would have the route_id value blank but keep the route_type value.