This chapter will demonstrate how to save data from a GTFS-realtime feed into an SQLite database. In order to look up trips, stops, routes and other data from the GTFS-realtime feed, this SQLite database will also contain the data from the corresponding GTFS feed.
To do this, the
GtfsRealTimeToSql tools which have
been written for the purpose of this book and the preceding book, The
Definitive Guide to GTFS (http://gtfsbook.com),
will be used.
GTFS and GTFS-realtime feeds from the MBTA in Boston (https://openmobilitydata.org/p/mbta) will also be used.
The Definitive Guide to GTFS demonstrated how to populate an SQLite
GtfsToSql. Here is an abbreviated version of the
steps required to do so.
https://github.com/OpenMobilityData/GtfsToSql. This is a Java command-line
application to import a GTFS feed into an SQLite database.
GtfsToSql Java archive can be downloaded from its
GitHub repository at
Next, download the MBTA GTFS feed, available from http://www.mbta.com/uploadedfiles/MBTA_GTFS.zip.
$ curl http://www.mbta.com/uploadedfiles/MBTA_GTFS.zip -o gtfs.zip $ unzip gtfs.zip -d mbta/
To create an SQLite database from this feed, the following command can be used:
$ java -jar GtfsToSql.jar -s jdbc:sqlite:./db.sqlite -g ./mbta -o
Note: The -o flag enables the recording of additional useful data in
the database. For instance, each entry in the trips table will contain
the departure and arrival time (which would otherwise only be available
by looking up the
This may take a minute or two to complete (you will see progress as it
imports the feed and then creates indexes), and at the end you will have
a GTFS database in a file called
db.sqlite. You can then query this
database with the command-line
sqlite3 tool, as shown in the
$ sqlite3 db.sqlite sqlite> SELECT agency_id, agency_name, agency_url, agency_lang FROM agency; 2|Massport|http://www.massport.com|EN 1|MBTA|[http://www.mbta.com](http://www.mbta.com/)|EN
Note: For more information about storing and querying GTFS data, please refer to The Definitive Guide to GTFS, available from http://gtfsbook.com.
Once the GTFS data has been imported into the SQLite database, you can
then start importing GTFS-realtime data. For this you can use the
GtfsRealTimeToSql tool specifically written to import data into an
Note: Technically you do not need the GTFS data present in the database as well, but having it makes it far simpler to resolve trip, route and stop data.
The GTFS data will change infrequently (perhaps every few weeks or
months), while the realtime data can change several times per minute.
GtfsRealTimeToSql will frequently update a feed, according to the
refresh time specified. Each time it updates, the data saved on the
previous iteration is deleted, as that data is no longer the most
To get started, download
GtfsRealTimeToSql from its GitHub
repository at https://github.com/OpenMobilityData/GtfsRealTimeToSql. Just
GtfsToSql, this is a Java command-line application. The
GtfsRealTimeToSql Java archive can be downloaded from
db.sqlite database contains the MBTA GTFS feed, you can
GtfsRealTimeToSql with one of MBTA's GTFS-realtime feeds.
For instance, their vehicle positions feed is located at
java -jar GtfsRealTimeToSql.jar \ -u "http://developer.mbta.com/lib/gtrtfs/Vehicles.pb" \ -s jdbc:sqlite:./db.sqlite \ -refresh 15
When you run this command, the vehicle positions feed will be retrieved
every 15 seconds (specified by the
-refresh parameter), and the data
will be saved into the
db.sqlite SQLite database.
MBTA also have a trip updates feed and a service alerts feed. In order
to load each of these feeds you will need to run
three separate times. To allow it to run in the background, add the
-d parameter (to make it run as a server daemon), and background it
To load the vehicle positions in the background, stop the previous command, then run the following command instead.
java -jar GtfsRealTimeToSql.jar \ -u "http://developer.mbta.com/lib/gtrtfs/Vehicles.pb" \ -s jdbc:sqlite:./db.sqlite \ -refresh 15 \ -d &
Now you can also load the trip updates into the same
MBTA's trip updates feed is located at
http://developer.mbta.com/lib/gtrtfs/Passages.pb. The following
command reloads the trip updates every 30 seconds:
java -jar GtfsRealTimeToSql.jar \ -u "http://developer.mbta.com/lib/gtrtfs/Passages.pb" \ -s jdbc:sqlite:./db.sqlite \ -refresh 30 \ -d &
Note: You may prefer more frequent updates (such as 15 seconds), or even less frequent (such as 60 seconds). The more frequently you update, the greater your server utilization will be.
Finally, to load the service alerts feed, use the same command, but now with the feed located at http://developer.mbta.com/lib/GTRTFS/Alerts/Alerts.pb. Generally, service alerts are updated infrequently by providers, so you can use a refresh time such as five or ten minutes (300 or 600 seconds).
java -jar GtfsRealTimeToSql.jar \ -u "http://developer.mbta.com/lib/gtrtfs/Alerts/Alerts.pb" \ -s jdbc:sqlite:./db.sqlite \ -refresh 300 \ -d &
These three feeds will continue to be reloaded until you terminate their respective processes.
GtfsRealTimeToSql, vehicle positions are stored in a
gtfs_rt_vehicles. If you want to retrieve positions
for, say, the route with an ID of 742, you can run the following query:
$ sqlite3 db.sqlite sqlite> SELECT route_id, trip_id, trip_date, trip_time, trip_sr, latitude, longitude FROM gtfs_rt_vehicles WHERE route_id = 742;
This query returns the trip descriptor and GPS coordinates for all vehicles on the given route. The following table shows sample results from this query.
Note: The trip_sr column corresponds to the trip's schedule relationship value.
This particular snapshot of vehicle position data shows two different trips for route 742, which corresponds to the SL2 Silver Line.
sqlite> SELECT route_short_name, route_long_name FROM routes WHERE route_id = 742; SL2|Silver Line SL2
The first five columns retrieved are the trip identifier fields, which
are used to link a vehicle position with a trip from the GTFS feed. The
first row is simple: the
trip_id value can be used to look up the
row from the
sqlite> SELECT service_id, trip_headsign, direction_id, block_id FROM trips WHERE trip_id = '25900860';
The results from this query are as follows.
Note: To see all fields available in this or any other table
gtfs_rt_vehicles), use the .schema command in SQLite. For
instance, enter the command
One helpful thing MBTA does is to include the trip starting date when they include the trip ID. This helps to disambiguate trips that may start or finish around midnight or later.
The second vehicle position is another matter. This record does not include any trip descriptor information other than the route ID. This means you cannot reliably link this vehicle position with a specific trip from the GTFS feed.
However, knowing the route ID and the vehicle's coordinates may be enough to present useful information to the user: "A bus on the Silver Line SL2 route is at this location." You cannot show them if the vehicle is running late, early or on-time, but you can show the user where the vehicle is.
GtfsRealTimeToSql tool stores trip update data in two tables:
one for the main trip update data (such as the trip descriptor), and
another to store each individual stop time event that belongs within
For example, to retrieve all trip updates for the route with ID 742 once again, you could use the following query:
SELECT route_id, trip_id, trip_date, trip_time, trip_sr, vehicle_id, vehicle_label FROM gtfs_rt_trip_updates WHERE route_id = '742';
This query will return data similar to the following table:
Each of these returned records has corresponding records in the
gtfs_rt_trip_updates_stoptimes table that includes the
arrival/departure estimates for various stops on the trip.
GtfsRealTimeToSql, an extra column called
is included on both tables so they can be linked together. For example,
to retrieve all updates for the route ID 742, you can join the tables
together as follows:
SELECT trip_id, arrival_time, arrival_delay, departure_time, departure_delay, stop_id, stop_sequence FROM gtfs_rt_trip_updates_stoptimes JOIN gtfs_rt_trip_updates USING (update_id) WHERE route_id = '742';
MBTA's trip updates feed only includes a stop time prediction for the next stop. This means that each trip in the results only has one corresponding record. You must then apply the delay to subsequent stop times for the given trip.
There are several things to note in these results. Firstly, MBTA do not
provide a timestamp for
only provide the delay offsets that can be compared to the scheduled
time in the GTFS feed.
Secondly, the second and third trips listed have not yet commenced. You
can deduce this just by looking at this table, since the next stop has
stop sequence of
1 (and therefore there are no stops before it).
The first trip is delayed by 30 seconds. In other words, it will arrive 30 seconds later than it was scheduled. The data received from the GTFS-realtime feed does not actually indicate the arrival timestamp, but you can look up the corresponding stop time from the GTFS feed to determine this.
The following query demonstrates how to look up the arrival time:
SELECT s.stop_name, st.arrival_time FROM stop_times st, trips t, stops s WHERE st.trip_index = t.trip_index AND st.stop_index = s.stop_index AND s.stop_id = '74617' AND t.trip_id = '25900860' AND st.stop_sequence = 10;
GtfsToSql tool used to import the GTFS feed adds fields
such as trip_index and stop_index in order to speed up data searching.
There is more discussion on the rationale of this in The Definitive
Guide to GTFS, available from
This query joins the
stop_times table to both the
stops table in order to look up the corresponding arrival time. The
final three rows in the query contain the values returned above (the
trip_id and the
stop_sequence), to find the
South Station Silver Line - Inbound|21:17:00
This means the scheduled arrival time for South Station Silver Line is 9:17:00 PM. Since the estimate indicates a delay of 30 seconds, the new arrival time is 9:17:30 PM.
Note: Conversely, if the delay was -30 instead of 30, the vehicle would be early and arrive at 9:16:30 PM.
One thing not touched upon in this example is the stop time's schedule
gtfs_rt_trip_updates_stoptimes also includes a
rship, which is used to indicate the schedule
relationship for the given stop. If this value was skipped (a value of
1), then it means the vehicle will stop here.
In GTFS, the
block_id value in trips.txt is used to indicate a
series of one or more trips undertaken by a single vehicle. In other
words, once it gets to the end of one trip, it starts a new trip from
that location. If a given trip is very short, a vehicle may perform up
to fifty or one hundred trips in a single day.
This can be useful for determining estimates for future trips that may not be included in the GTFS feed. For instance, if a trip is running 30 minutes late, then it is highly likely that subsequent trips on that block will also be running late.
Referring back to the trip data returned in the above example, the following data can be retrieved from the GTFS feed:
SELECT trip_id, block_id, service_id, departure_time, arrival_time FROM trips WHERE trip_id IN ('25900860', '25900856', '25900858') ORDER BY departure_time;
This returns the following data.
Each of these trips have the same values for
service_id, meaning the same vehicle will complete all three trips.
The data indicates that the first trip is running 30 seconds late, so
its arrival time will be 21:17:30.
Because the second trip is scheduled to depart at 21:18:00, there is a buffer time to catch up (in other words, the first trip is only 30 seconds late, so hopefully it will not impact the second trip).
If, for instance, the second trip ran 10 minutes late (and therefore arrived at 21:38:00 instead of 21:28:00), you could reasonably assume the third trip would begin at about 21:38:00 instead of 21:35:00 (about three minutes late).
GtfsRealTimeToSql to store service alerts, data is stored
in three tables. The main service alert information is stored in
gtfs_rt_alerts. Since there can be any number of time ranges or
affected entities for any given alert, there is a table to hold time
gtfs_rt_alerts_timeranges) and one to hold affected entities
In order to link this data together, each alert has an
value (created by
GtfsRealTimeToSql) which is also present for any
corresponding time range and affected entities records.
To retrieve service alerts from the database, you can use the following query:
SELECT alert_id, header, description, cause, effect FROM gtfs_rt_alerts;
At the time of writing, this yields the following results. The description has been shortened as they are quite long and descriptive in the original feed.
|6||Route 11 detour||Route 11 outbound detoured due to ...||1||4|
|11||Ruggles elevator unavailable||... Commuter Rail platform to the lobby is unavailable ...||9||7|
|33||Extra Franklin Line service||1||5|
**Note: **A cause value of 1 corresponds to UNKNOWN_CAUSE, while 9 corresponds to MAINTENANCE. An effect value of 4 corresponds to DETOUR, 7 corresponds to OTHER_EFFECT, while 5 corresponds to ADDITIONAL_SERVICE.
To determine the timing of these alerts, look up the
gtfs_rt_alerts_timeranges for the given alerts.
$ export TZ=America/New_York $ sqlite3 db.sqlite sqlite> SELECT alert_id, datetime(start, 'unixepoch', 'localtime'), datetime(finish, 'unixepoch', 'localtime') FROM gtfs_rt_alerts_timeranges WHERE alert_id IN (6, 11, 33);
Note: In this example, the system timezone has been temporarily changed to America/New_York (the timezone specified in MBTA's agency.txt file) so the timestamps are formatted correctly. You may prefer instead to format the start and finish timestamps using your programming language of choice.
|11||2015-03-18 06:00:00||2015-03-18 16:00:00|
|11||2015-03-19 06:00:00||2015-03-19 16:00:00|
|33||2015-03-16 10:58:38||2015-03-18 02:30:00|
These dates indicate the following:
- The alert with an ID of 6 began on February 17 and has no specified end date.
- The alert with an ID of 11 will occur over two days between 6 AM and 4 PM. *The alert with an ID of 33 will last for almost two days.
Finally, to determine which entities (routes, trips, stops) are affected
by these alerts, query the
SELECT alert_id, agency_id, route_id, route_type, stop_id, trip_id, trip_start_date, trip_start_time, trip_rship FROM gtfs_rt_alerts_entities WHERE alert_id IN (6, 11, 33);
This results in the following data, describing only an entity for the final service alert.
route_id value, you can find the route from the GTFS feed:
SELECT agency_id, route_type, route_long_name, route_desc FROM routes WHERE route_id = 'CR-Franklin';
This query will result in the following data:
|1||2||Franklin Line||Commuter Rail|
In effect, this means that if you have a web site or app displaying the schedule for the Franklin line, this alert should be displayed so the people who use the line are aware of the change.
Note: Even though the
route_type values are included,
you do not really need these since the
route_id can only refer to one
row in the GTFS feed. If instead you wanted to refer to ALL rail lines,
the alert would have the
route_id value blank but keep the