## 9. Storing Feed Data in a Database This chapter will demonstrate how to save data from a GTFS-realtime feed into an SQLite database. In order to look up trips, stops, routes and other data from the GTFS-realtime feed, this SQLite database will also contain the data from the corresponding GTFS feed. To do this, the `GtfsToSql` and `GtfsRealTimeToSql` tools which have been written for the purpose of this book and the preceding book, *The Definitive Guide to GTFS* ([http://gtfsbook.com](http://gtfsbook.com/)), will be used. GTFS and GTFS-realtime feeds from the MBTA in Boston () will also be used. ### Storing GTFS Data in an SQLite Database *The Definitive Guide to GTFS* demonstrated how to populate an SQLite database using `GtfsToSql`. Here is an abbreviated version of the steps required to do so. First, download `GtfsToSql` from . This is a Java command-line application to import a GTFS feed into an SQLite database. The pre-compiled `GtfsToSql` Java archive can be downloaded from its GitHub repository at . Next, download the MBTA GTFS feed, available from . ``` $ curl http://www.mbta.com/uploadedfiles/MBTA_GTFS.zip -o gtfs.zip $ unzip gtfs.zip -d mbta/ ``` To create an SQLite database from this feed, the following command can be used: ``` $ java -jar GtfsToSql.jar -s jdbc:sqlite:./db.sqlite -g ./mbta -o ``` ***Note:** The -o flag enables the recording of additional useful data in the database. For instance, each entry in the trips table will contain the departure and arrival time (which would otherwise only be available by looking up the `stop_times` table).* This may take a minute or two to complete (you will see progress as it imports the feed and then creates indexes), and at the end you will have a GTFS database in a file called `db.sqlite`. You can then query this database with the command-line `sqlite3` tool, as shown in the following example: ``` $ sqlite3 db.sqlite sqlite> SELECT agency_id, agency_name, agency_url, agency_lang FROM agency; 2|Massport|http://www.massport.com|EN 1|MBTA|[http://www.mbta.com](http://www.mbta.com/)|EN ``` ***Note:** For more information about storing and querying GTFS data, please refer to *The Definitive Guide to GTFS*, available from [http://gtfsbook.com](http://gtfsbook.com/).* ### Storing GTFS-realtime Data in an SQLite Database Once the GTFS data has been imported into the SQLite database, you can then start importing GTFS-realtime data. For this you can use the `GtfsRealTimeToSql` tool specifically written to import data into an SQLite database. ***Note:** Technically you do not need the GTFS data present in the database as well, but having it makes it far simpler to resolve trip, route and stop data.* The GTFS data will change infrequently (perhaps every few weeks or months), while the realtime data can change several times per minute. `GtfsRealTimeToSql` will frequently update a feed, according to the refresh time specified. Each time it updates, the data saved on the previous iteration is deleted, as that data is no longer the most up-to-date data. To get started, download `GtfsRealTimeToSql` from its GitHub repository at . Just like `GtfsToSql`, this is a Java command-line application. The pre-compiled `GtfsRealTimeToSql` Java archive can be downloaded from . Since the `db.sqlite` database contains the MBTA GTFS feed, you can now run `GtfsRealTimeToSql` with one of MBTA's GTFS-realtime feeds. For instance, their vehicle positions feed is located at . ``` java -jar GtfsRealTimeToSql.jar \ -u "http://developer.mbta.com/lib/gtrtfs/Vehicles.pb" \ -s jdbc:sqlite:./db.sqlite \ -refresh 15 ``` When you run this command, the vehicle positions feed will be retrieved every 15 seconds (specified by the `-refresh` parameter), and the data will be saved into the `db.sqlite` SQLite database. MBTA also have a trip updates feed and a service alerts feed. In order to load each of these feeds you will need to run `GtfsRealTimeToSql` three separate times. To allow it to run in the background, add the `-d` parameter (to make it run as a server daemon), and background it using **&**. To load the vehicle positions in the background, stop the previous command, then run the following command instead. ``` java -jar GtfsRealTimeToSql.jar \ -u "http://developer.mbta.com/lib/gtrtfs/Vehicles.pb" \ -s jdbc:sqlite:./db.sqlite \ -refresh 15 \ -d & ``` Now you can also load the trip updates into the same `db.sqlite` file. MBTA's trip updates feed is located at . The following command reloads the trip updates every 30 seconds: ``` java -jar GtfsRealTimeToSql.jar \ -u "http://developer.mbta.com/lib/gtrtfs/Passages.pb" \ -s jdbc:sqlite:./db.sqlite \ -refresh 30 \ -d & ``` ***Note:** You may prefer more frequent updates (such as 15 seconds), or even less frequent (such as 60 seconds). The more frequently you update, the greater your server utilization will be.* Finally, to load the service alerts feed, use the same command, but now with the feed located at . Generally, service alerts are updated infrequently by providers, so you can use a refresh time such as five or ten minutes (300 or 600 seconds). ``` java -jar GtfsRealTimeToSql.jar \ -u "http://developer.mbta.com/lib/gtrtfs/Alerts/Alerts.pb" \ -s jdbc:sqlite:./db.sqlite \ -refresh 300 \ -d & ``` These three feeds will continue to be reloaded until you terminate their respective processes. ### Querying Vehicle Positions When using `GtfsRealTimeToSql`, vehicle positions are stored in a table called `gtfs_rt_vehicles`. If you want to retrieve positions for, say, the route with an ID of 742, you can run the following query: ``` $ sqlite3 db.sqlite sqlite> SELECT route_id, trip_id, trip_date, trip_time, trip_sr, latitude, longitude FROM gtfs_rt_vehicles WHERE route_id = 742; ``` This query returns the trip descriptor and GPS coordinates for all vehicles on the given route. The following table shows sample results from this query. | `route_id` | `trip_id` | `trip_date` | `trip_time` | `trip_sr` | `latitude` | `longitude` | | :--------- | :--------- | :----------- | :---------- | :-------- | :--------- | :---------- | | 742 | 25900860 | 20150315 | | 0 | 42.347309 | -71.040359 | | 742 | | | | | 42.331505 | -71.065590 | ***Note:** The trip_sr column corresponds to the trip's schedule relationship value.* This particular snapshot of vehicle position data shows two different trips for route 742, which corresponds to the SL2 Silver Line. ``` sqlite> SELECT route_short_name, route_long_name FROM routes WHERE route_id = 742; SL2|Silver Line SL2 ``` The first five columns retrieved are the trip identifier fields, which are used to link a vehicle position with a trip from the GTFS feed. The first row is simple: the `trip_id` value can be used to look up the row from the `trips` table. ``` sqlite> SELECT service_id, trip_headsign, direction_id, block_id FROM trips WHERE trip_id = '25900860'; ``` The results from this query are as follows. | `service_id` | `trip_headsign` | `direction_id` | `block_id` | | :--------------------------- | :-------------- | :------------- | :--------- | | BUSS12015-hbs15017-Sunday-02 | South Station | 1 | S742-61 | ***Note:** To see all fields available in this or any other table (including `gtfs_rt_vehicles`), use the .schema command in SQLite. For instance, enter the command `.schema gtfs_rt_vehicles`.* One helpful thing MBTA does is to include the trip starting date when they include the trip ID. This helps to disambiguate trips that may start or finish around midnight or later. The second vehicle position is another matter. This record does not include any trip descriptor information other than the route ID. This means you cannot reliably link this vehicle position with a specific trip from the GTFS feed. However, knowing the route ID and the vehicle's coordinates may be enough to present useful information to the user: "A bus on the Silver Line SL2 route is at this location." You cannot show them if the vehicle is running late, early or on-time, but you can show the user where the vehicle is. ### Querying Trip Updates The `GtfsRealTimeToSql` tool stores trip update data in two tables: one for the main trip update data (such as the trip descriptor), and another to store each individual stop time event that belongs within each update. For example, to retrieve all trip updates for the route with ID 742 once again, you could use the following query: ```sql SELECT route_id, trip_id, trip_date, trip_time, trip_sr, vehicle_id, vehicle_label FROM gtfs_rt_trip_updates WHERE route_id = '742'; ``` This query will return data similar to the following table: | `route_id` | `trip_id` | `trip_date` | `trip_time` | `trip_sr` | `vehicle_id` | `vehicle_label` | | :--------- | :-------- | :---------- | :---------- | :-------- | :----------- | :-------------- | | 742 | 25900860 | 20150315 | | 0 | y1106 | 1106 | | 742 | 25900856 | 20150315 | | 0 | | | | 742 | 25900858 | 20150315 | | 0 | | | Each of these returned records has corresponding records in the `gtfs_rt_trip_updates_stoptimes` table that includes the arrival/departure estimates for various stops on the trip. When using `GtfsRealTimeToSql`, an extra column called `update_id` is included on both tables so they can be linked together. For example, to retrieve all updates for the route ID 742, you can join the tables together as follows: ```sql SELECT trip_id, arrival_time, arrival_delay, departure_time, departure_delay, stop_id, stop_sequence FROM gtfs_rt_trip_updates_stoptimes JOIN gtfs_rt_trip_updates USING (update_id) WHERE route_id = '742'; ``` MBTA's trip updates feed only includes a stop time prediction for the next stop. This means that each trip in the results only has one corresponding record. You must then apply the delay to subsequent stop times for the given trip. | `trip_id` | `arrival_time` | `arrival_delay` | `departure_time` | `departure_delay` | `stop_id` | `stop_sequence` | | :-------- | :------------- | :-------------- | :--------------- | :---------------- | :-------- | :-------------- | | 25900860 | | 30 | | | 74617 | 10 | | 25900856 | | | | 0 | 74611 | 1 | | 25900858 | | | | 0 | 31255 | 1 | There are several things to note in these results. Firstly, MBTA do not provide a timestamp for `arrival_time` and `departure_time`; they only provide the delay offsets that can be compared to the scheduled time in the GTFS feed. Secondly, the second and third trips listed have not yet commenced. You can deduce this just by looking at this table, since the next stop has stop sequence of `1` (and therefore there are no stops before it). The first trip is delayed by 30 seconds. In other words, it will arrive 30 seconds later than it was scheduled. The data received from the GTFS-realtime feed does not actually indicate the arrival timestamp, but you can look up the corresponding stop time from the GTFS feed to determine this. The following query demonstrates how to look up the arrival time: ```sql SELECT s.stop_name, st.arrival_time FROM stop_times st, trips t, stops s WHERE st.trip_index = t.trip_index AND st.stop_index = s.stop_index AND s.stop_id = '74617' AND t.trip_id = '25900860' AND st.stop_sequence = 10; ``` ***Note:** The `GtfsToSql` tool used to import the GTFS feed adds fields such as trip_index and stop_index in order to speed up data searching. There is more discussion on the rationale of this in *The Definitive Guide to GTFS*, available from [http://gtfsbook.com](http://gtfsbook.com/).* This query joins the `stop_times` table to both the `trips` and `stops` table in order to look up the corresponding arrival time. The final three rows in the query contain the values returned above (the `stop_id`, `trip_id` and the `stop_sequence`), to find the following record: ``` South Station Silver Line - Inbound|21:17:00 ``` This means the scheduled arrival time for South Station Silver Line is 9:17:00 PM. Since the estimate indicates a delay of 30 seconds, the new arrival time is 9:17:30 PM. ***Note:** Conversely, if the delay was -30 instead of 30, the vehicle would be early and arrive at 9:16:30 PM.* One thing not touched upon in this example is the stop time's schedule relationship. The `gtfs_rt_trip_updates_stoptimes` also includes a column called `rship`, which is used to indicate the schedule relationship for the given stop. If this value was skipped (a value of `1`), then it means the vehicle will stop here. ### Determining Predictions Using Blocks In GTFS, the `block_id` value in **trips.txt** is used to indicate a series of one or more trips undertaken by a single vehicle. In other words, once it gets to the end of one trip, it starts a new trip from that location. If a given trip is very short, a vehicle may perform up to fifty or one hundred trips in a single day. This can be useful for determining estimates for future trips that may not be included in the GTFS feed. For instance, if a trip is running 30 minutes late, then it is highly likely that subsequent trips on that block will also be running late. Referring back to the trip data returned in the above example, the following data can be retrieved from the GTFS feed: ```sql SELECT trip_id, block_id, service_id, departure_time, arrival_time FROM trips WHERE trip_id IN ('25900860', '25900856', '25900858') ORDER BY departure_time; ``` This returns the following data. | `trip_id` | `block_id` | `service_id` | `departure_time` | `arrival_time` | | :-------- | :--------- | :--------------------------- | :--------------- | :------------- | | 25900860 | S742-61 | BUSS12015-hbs15017-Sunday-02 | 21:04:00 | 21:17:00 | | 25900856 | S742-61 | BUSS12015-hbs15017-Sunday-02 | 21:18:00 | 21:28:00 | | 25900858 | S742-61 | BUSS12015-hbs15017-Sunday-02 | 21:35:00 | 21:48:00 | Each of these trips have the same values for `block_id` and `service_id`, meaning the same vehicle will complete all three trips. The data indicates that the first trip is running 30 seconds late, so its arrival time will be 21:17:30. Because the second trip is scheduled to depart at 21:18:00, there is a buffer time to catch up (in other words, the first trip is only 30 seconds late, so hopefully it will not impact the second trip). If, for instance, the second trip ran 10 minutes late (and therefore arrived at 21:38:00 instead of 21:28:00), you could reasonably assume the third trip would begin at about 21:38:00 instead of 21:35:00 (about three minutes late). ### Querying Service Alerts When using `GtfsRealTimeToSql` to store service alerts, data is stored in three tables. The main service alert information is stored in `gtfs_rt_alerts`. Since there can be any number of time ranges or affected entities for any given alert, there is a table to hold time ranges (`gtfs_rt_alerts_timeranges`) and one to hold affected entities (`gtfs_rt_alerts_entities`). In order to link this data together, each alert has an `alert_id` value (created by `GtfsRealTimeToSql`) which is also present for any corresponding time range and affected entities records. To retrieve service alerts from the database, you can use the following query: ```sql SELECT alert_id, header, description, cause, effect FROM gtfs_rt_alerts; ``` At the time of writing, this yields the following results. The description has been shortened as they are quite long and descriptive in the original feed. | `alert_id` | `header` | `description` | `cause` | `effect` | | :--------- | :--------------------------- | :--------------------------------------------------------- | :------- | :------- | | 6 | Route 11 detour | Route 11 outbound detoured due to ... | 1 | 4 | | 11 | Ruggles elevator unavailable | ... Commuter Rail platform to the lobby is unavailable ... | 9 | 7 | | 33 | Extra Franklin Line service | | 1 | 5 | **Note: **A cause value of 1 corresponds to UNKNOWN_CAUSE, while 9 corresponds to MAINTENANCE. An effect value of 4 corresponds to DETOUR, 7 corresponds to OTHER_EFFECT, while 5 corresponds to ADDITIONAL_SERVICE. To determine the timing of these alerts, look up the `gtfs_rt_alerts_timeranges` for the given alerts. ``` $ export TZ=America/New_York $ sqlite3 db.sqlite sqlite> SELECT alert_id, datetime(start, 'unixepoch', 'localtime'), datetime(finish, 'unixepoch', 'localtime') FROM gtfs_rt_alerts_timeranges WHERE alert_id IN (6, 11, 33); ``` ***Note:** In this example, the system timezone has been temporarily changed to America/New_York (the timezone specified in MBTA's agency.txt file) so the timestamps are formatted correctly. You may prefer instead to format the start and finish timestamps using your programming language of choice.* | `alert_id` | `start` (formatted) | `finish` (formatted) | | :--------- | :------------------ | :------------------- | | 6 | 2015-02-17 15:56:52 | | | 11 | 2015-03-18 06:00:00 | 2015-03-18 16:00:00 | | 11 | 2015-03-19 06:00:00 | 2015-03-19 16:00:00 | | 33 | 2015-03-16 10:58:38 | 2015-03-18 02:30:00 | These dates indicate the following: * The alert with an ID of 6 began on February 17 and has no specified end date. * The alert with an ID of 11 will occur over two days between 6 AM and 4 PM. *The alert with an ID of 33 will last for almost two days. Finally, to determine which entities (routes, trips, stops) are affected by these alerts, query the `gtfs_rt_alerts_entities` table. ```sql SELECT alert_id, agency_id, route_id, route_type, stop_id, trip_id, trip_start_date, trip_start_time, trip_rship FROM gtfs_rt_alerts_entities WHERE alert_id IN (6, 11, 33); ``` This results in the following data, describing only an entity for the final service alert. | `alert_id` | `agency_id` | `route_id` | `route_type` | `stop_id` | `trip_id` | `trip_start_date` | `trip_start_time` | `trip_rship` | | :--------- | :---------- | :---------- | :----------- | :-------- | :-------- | :---------------- | :---------------- | :----------- | | 6 | 1 | CR-Franklin | 2 | | | | | | Using the `route_id` value, you can find the route from the GTFS feed: ```sql SELECT agency_id, route_type, route_long_name, route_desc FROM routes WHERE route_id = 'CR-Franklin'; ``` This query will result in the following data: | `agency_id` | `route_type` | `route_long_name` | `route_desc` | | :---------- | :----------- | :---------------- | :------------ | | 1 | 2 | Franklin Line | Commuter Rail | In effect, this means that if you have a web site or app displaying the schedule for the Franklin line, this alert should be displayed so the people who use the line are aware of the change. ***Note:** Even though the `agency_id` and `route_type` values are included, you do not really need these since the `route_id` can only refer to one row in the GTFS feed. If instead you wanted to refer to ALL rail lines, the alert would have the `route_id` value blank but keep the `route_type` value.*