aboutsummaryrefslogtreecommitdiff
path: root/.idea/gtfs-realtime-book/ch-09-database-storage.md
blob: e9b42f2a55b8a7bd30a76c8d45eef981546d5c61 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
## 9. Storing Feed Data in a Database

This chapter will demonstrate how to save data from a GTFS-realtime feed
into an SQLite database. In order to look up trips, stops, routes and
other data from the GTFS-realtime feed, this SQLite database will also
contain the data from the corresponding GTFS feed.

To do this, the `GtfsToSql` and `GtfsRealTimeToSql` tools which have
been written for the purpose of this book and the preceding book, *The
Definitive Guide to GTFS* ([http://gtfsbook.com](http://gtfsbook.com/)),
will be used.

GTFS and GTFS-realtime feeds from the MBTA in Boston
(<https://openmobilitydata.org/p/mbta>) will also be used.

### Storing GTFS Data in an SQLite Database

*The Definitive Guide to GTFS* demonstrated how to populate an SQLite
database using `GtfsToSql`. Here is an abbreviated version of the
steps required to do so.

First, download `GtfsToSql` from
<https://github.com/OpenMobilityData/GtfsToSql>. This is a Java command-line
application to import a GTFS feed into an SQLite database.

The pre-compiled `GtfsToSql` Java archive can be downloaded from its
GitHub repository at
<https://github.com/OpenMobilityData/GtfsToSql/tree/master/dist>.

Next, download the MBTA GTFS feed, available from
<http://www.mbta.com/uploadedfiles/MBTA_GTFS.zip>.

```
$ curl http://www.mbta.com/uploadedfiles/MBTA_GTFS.zip -o gtfs.zip

$ unzip gtfs.zip -d mbta/
```

To create an SQLite database from this feed, the following command can
be used:

```
$ java -jar GtfsToSql.jar -s jdbc:sqlite:./db.sqlite -g ./mbta -o
```

***Note:** The -o flag enables the recording of additional useful data in
the database. For instance, each entry in the trips table will contain
the departure and arrival time (which would otherwise only be available
by looking up the `stop_times` table).*

This may take a minute or two to complete (you will see progress as it
imports the feed and then creates indexes), and at the end you will have
a GTFS database in a file called `db.sqlite`. You can then query this
database with the command-line `sqlite3` tool, as shown in the
following example:

```
$ sqlite3 db.sqlite

sqlite> SELECT agency_id, agency_name, agency_url, agency_lang FROM agency;

2|Massport|http://www.massport.com|EN

1|MBTA|[http://www.mbta.com](http://www.mbta.com/)|EN
```

***Note:** For more information about storing and querying GTFS data,
please refer to *The Definitive Guide to GTFS*, available from
[http://gtfsbook.com](http://gtfsbook.com/).*

### Storing GTFS-realtime Data in an SQLite Database

Once the GTFS data has been imported into the SQLite database, you can
then start importing GTFS-realtime data. For this you can use the
`GtfsRealTimeToSql` tool specifically written to import data into an
SQLite database.

***Note:** Technically you do not need the GTFS data present in the
database as well, but having it makes it far simpler to resolve trip,
route and stop data.*

The GTFS data will change infrequently (perhaps every few weeks or
months), while the realtime data can change several times per minute.
`GtfsRealTimeToSql` will frequently update a feed, according to the
refresh time specified. Each time it updates, the data saved on the
previous iteration is deleted, as that data is no longer the most
up-to-date data.

To get started, download `GtfsRealTimeToSql` from its GitHub
repository at <https://github.com/OpenMobilityData/GtfsRealTimeToSql>. Just
like `GtfsToSql`, this is a Java command-line application. The
pre-compiled `GtfsRealTimeToSql` Java archive can be downloaded from
<https://github.com/OpenMobilityData/GtfsRealTimeToSql/tree/master/dist>.

Since the `db.sqlite` database contains the MBTA GTFS feed, you can
now run `GtfsRealTimeToSql` with one of MBTA's GTFS-realtime feeds.
For instance, their vehicle positions feed is located at
<http://developer.mbta.com/lib/gtrtfs/Vehicles.pb>.

```
java -jar GtfsRealTimeToSql.jar \
  -u "http://developer.mbta.com/lib/gtrtfs/Vehicles.pb" \
  -s jdbc:sqlite:./db.sqlite \
  -refresh 15
```

When you run this command, the vehicle positions feed will be retrieved
every 15 seconds (specified by the `-refresh` parameter), and the data
will be saved into the `db.sqlite` SQLite database.

MBTA also have a trip updates feed and a service alerts feed. In order
to load each of these feeds you will need to run `GtfsRealTimeToSql`
three separate times. To allow it to run in the background, add the
`-d` parameter (to make it run as a server daemon), and background it
using **&**.

To load the vehicle positions in the background, stop the previous
command, then run the following command instead.

```
java -jar GtfsRealTimeToSql.jar \
  -u "http://developer.mbta.com/lib/gtrtfs/Vehicles.pb" \
  -s jdbc:sqlite:./db.sqlite \
  -refresh 15 \
  -d &
```

Now you can also load the trip updates into the same `db.sqlite` file.
MBTA's trip updates feed is located at
<http://developer.mbta.com/lib/gtrtfs/Passages.pb>. The following
command reloads the trip updates every 30 seconds:

```
java -jar GtfsRealTimeToSql.jar \
  -u "http://developer.mbta.com/lib/gtrtfs/Passages.pb" \
  -s jdbc:sqlite:./db.sqlite \
  -refresh 30 \
  -d &
```

***Note:** You may prefer more frequent updates (such as 15 seconds), or
even less frequent (such as 60 seconds). The more frequently you update,
the greater your server utilization will be.*

Finally, to load the service alerts feed, use the same command, but now
with the feed located at
<http://developer.mbta.com/lib/GTRTFS/Alerts/Alerts.pb>. Generally,
service alerts are updated infrequently by providers, so you can use a
refresh time such as five or ten minutes (300 or 600 seconds).

```
java -jar GtfsRealTimeToSql.jar \
  -u "http://developer.mbta.com/lib/gtrtfs/Alerts/Alerts.pb" \
  -s jdbc:sqlite:./db.sqlite \
  -refresh 300 \
  -d &
```

These three feeds will continue to be reloaded until you terminate their
respective processes.

### Querying Vehicle Positions

When using `GtfsRealTimeToSql`, vehicle positions are stored in a
table called `gtfs_rt_vehicles`. If you want to retrieve positions
for, say, the route with an ID of 742, you can run the following query:

```
$ sqlite3 db.sqlite

sqlite> SELECT route_id, trip_id, trip_date, trip_time, trip_sr, latitude, longitude
  FROM gtfs_rt_vehicles
  WHERE route_id = 742;
```

This query returns the trip descriptor and GPS coordinates for all
vehicles on the given route. The following table shows sample results
from this query.

| `route_id` |  `trip_id` |  `trip_date` | `trip_time` | `trip_sr` | `latitude` | `longitude` |
| :--------- | :--------- | :----------- | :---------- | :-------- | :--------- | :---------- |
| 742        |  25900860  |  20150315    |             | 0         | 42.347309  | -71.040359  |
| 742        |            |              |             |           | 42.331505  | -71.065590  |

***Note:** The trip_sr column corresponds to the trip's schedule
relationship value.*

This particular snapshot of vehicle position data shows two different
trips for route 742, which corresponds to the SL2 Silver Line.

```
sqlite> SELECT route_short_name, route_long_name FROM routes WHERE
route_id = 742;

SL2|Silver Line SL2
```

The first five columns retrieved are the trip identifier fields, which
are used to link a vehicle position with a trip from the GTFS feed. The
first row is simple: the `trip_id` value can be used to look up the
row from the `trips` table.

```
sqlite> SELECT service_id, trip_headsign, direction_id, block_id

FROM trips

WHERE trip_id = '25900860';
```

The results from this query are as follows.

| `service_id`                 | `trip_headsign` | `direction_id` | `block_id` |
| :--------------------------- | :-------------- | :------------- | :--------- |
| BUSS12015-hbs15017-Sunday-02 | South Station   | 1              | S742-61    |

***Note:** To see all fields available in this or any other table
(including `gtfs_rt_vehicles`), use the .schema command in SQLite. For
instance, enter the command `.schema gtfs_rt_vehicles`.*

One helpful thing MBTA does is to include the trip starting date when
they include the trip ID. This helps to disambiguate trips that may
start or finish around midnight or later.

The second vehicle position is another matter. This record does not
include any trip descriptor information other than the route ID. This
means you cannot reliably link this vehicle position with a specific
trip from the GTFS feed.

However, knowing the route ID and the vehicle's coordinates may be
enough to present useful information to the user: "A bus on the Silver
Line SL2 route is at this location." You cannot show them if the vehicle
is running late, early or on-time, but you can show the user where the
vehicle is.

### Querying Trip Updates

The `GtfsRealTimeToSql` tool stores trip update data in two tables:
one for the main trip update data (such as the trip descriptor), and
another to store each individual stop time event that belongs within
each update.

For example, to retrieve all trip updates for the route with ID 742 once
again, you could use the following query:

```sql
SELECT route_id, trip_id, trip_date, trip_time, trip_sr, vehicle_id, vehicle_label
  FROM gtfs_rt_trip_updates
  WHERE route_id = '742';
```

This query will return data similar to the following table:

| `route_id` | `trip_id` | `trip_date` | `trip_time` | `trip_sr` | `vehicle_id` | `vehicle_label` |
| :--------- | :-------- | :---------- | :---------- | :-------- | :----------- | :-------------- |
| 742        | 25900860  | 20150315    |             | 0         | y1106        | 1106            |
| 742        | 25900856  | 20150315    |             | 0         |              |                 |
| 742        | 25900858  | 20150315    |             | 0         |              |                 |

Each of these returned records has corresponding records in the
`gtfs_rt_trip_updates_stoptimes` table that includes the
arrival/departure estimates for various stops on the trip.

When using `GtfsRealTimeToSql`, an extra column called `update_id`
is included on both tables so they can be linked together. For example,
to retrieve all updates for the route ID 742, you can join the tables
together as follows:

```sql
SELECT trip_id, arrival_time, arrival_delay, departure_time, departure_delay, stop_id, stop_sequence
  FROM gtfs_rt_trip_updates_stoptimes
  JOIN gtfs_rt_trip_updates USING (update_id)
  WHERE route_id = '742';
```

MBTA's trip updates feed only includes a stop time prediction for the
next stop. This means that each trip in the results only has one
corresponding record. You must then apply the delay to subsequent stop
times for the given trip.

| `trip_id` | `arrival_time` | `arrival_delay` | `departure_time` | `departure_delay` | `stop_id` | `stop_sequence` |
| :-------- | :------------- | :-------------- | :--------------- | :---------------- | :-------- | :-------------- |
| 25900860  |                | 30              |                  |                   | 74617     | 10              |
| 25900856  |                |                 |                  | 0                 | 74611     | 1               |
| 25900858  |                |                 |                  | 0                 | 31255     | 1               |

There are several things to note in these results. Firstly, MBTA do not
provide a timestamp for `arrival_time` and `departure_time`; they
only provide the delay offsets that can be compared to the scheduled
time in the GTFS feed.

Secondly, the second and third trips listed have not yet commenced. You
can deduce this just by looking at this table, since the next stop has
stop sequence of `1` (and therefore there are no stops before it).

The first trip is delayed by 30 seconds. In other words, it will arrive
30 seconds later than it was scheduled. The data received from the
GTFS-realtime feed does not actually indicate the arrival timestamp, but
you can look up the corresponding stop time from the GTFS feed to
determine this.

The following query demonstrates how to look up the arrival time:

```sql
SELECT s.stop_name, st.arrival_time
  FROM stop_times st, trips t, stops s
  WHERE st.trip_index = t.trip_index
  AND st.stop_index = s.stop_index
  AND s.stop_id = '74617'
  AND t.trip_id = '25900860'
  AND st.stop_sequence = 10;
```

***Note:** The `GtfsToSql` tool used to import the GTFS feed adds fields
such as trip_index and stop_index in order to speed up data searching.
There is more discussion on the rationale of this in *The Definitive
Guide to GTFS*, available from
[http://gtfsbook.com](http://gtfsbook.com/).*

This query joins the `stop_times` table to both the `trips` and
`stops` table in order to look up the corresponding arrival time. The
final three rows in the query contain the values returned above (the
`stop_id`, `trip_id` and the `stop_sequence`), to find the
following record:

```
South Station Silver Line - Inbound|21:17:00
```

This means the scheduled arrival time for South Station Silver Line is
9:17:00 PM. Since the estimate indicates a delay of 30 seconds, the new
arrival time is 9:17:30 PM.

***Note:** Conversely, if the delay was -30 instead of 30, the vehicle
would be early and arrive at 9:16:30 PM.*

One thing not touched upon in this example is the stop time's schedule
relationship. The `gtfs_rt_trip_updates_stoptimes` also includes a
column called `rship`, which is used to indicate the schedule
relationship for the given stop. If this value was skipped (a value of
`1`), then it means the vehicle will stop here.

### Determining Predictions Using Blocks

In GTFS, the `block_id` value in **trips.txt** is used to indicate a
series of one or more trips undertaken by a single vehicle. In other
words, once it gets to the end of one trip, it starts a new trip from
that location. If a given trip is very short, a vehicle may perform up
to fifty or one hundred trips in a single day.

This can be useful for determining estimates for future trips that may
not be included in the GTFS feed. For instance, if a trip is running 30
minutes late, then it is highly likely that subsequent trips on that
block will also be running late.

Referring back to the trip data returned in the above example, the
following data can be retrieved from the GTFS feed:

```sql
SELECT trip_id, block_id, service_id, departure_time, arrival_time
  FROM trips
  WHERE trip_id IN ('25900860', '25900856', '25900858')
  ORDER BY departure_time;
```

This returns the following data.

| `trip_id` | `block_id` | `service_id`                 | `departure_time` | `arrival_time` |
| :-------- | :--------- | :--------------------------- | :--------------- | :------------- |
| 25900860  | S742-61    | BUSS12015-hbs15017-Sunday-02 | 21:04:00         | 21:17:00       |
| 25900856  | S742-61    | BUSS12015-hbs15017-Sunday-02 | 21:18:00         | 21:28:00       |
| 25900858  | S742-61    | BUSS12015-hbs15017-Sunday-02 | 21:35:00         | 21:48:00       |

Each of these trips have the same values for `block_id` and
`service_id`, meaning the same vehicle will complete all three trips.
The data indicates that the first trip is running 30 seconds late, so
its arrival time will be 21:17:30.

Because the second trip is scheduled to depart at 21:18:00, there is a
buffer time to catch up (in other words, the first trip is only 30
seconds late, so hopefully it will not impact the second trip).

If, for instance, the second trip ran 10 minutes late (and therefore
arrived at 21:38:00 instead of 21:28:00), you could reasonably assume
the third trip would begin at about 21:38:00 instead of 21:35:00 (about
three minutes late).

### Querying Service Alerts

When using `GtfsRealTimeToSql` to store service alerts, data is stored
in three tables. The main service alert information is stored in
`gtfs_rt_alerts`. Since there can be any number of time ranges or
affected entities for any given alert, there is a table to hold time
ranges (`gtfs_rt_alerts_timeranges`) and one to hold affected entities
(`gtfs_rt_alerts_entities`).

In order to link this data together, each alert has an `alert_id`
value (created by `GtfsRealTimeToSql`) which is also present for any
corresponding time range and affected entities records.

To retrieve service alerts from the database, you can use the following
query:

```sql
SELECT alert_id, header, description, cause, effect FROM gtfs_rt_alerts;
```

At the time of writing, this yields the following results. The
description has been shortened as they are quite long and descriptive in
the original feed.

| `alert_id` | `header`                     | `description`                                              |  `cause` | `effect` |
| :--------- | :--------------------------- | :--------------------------------------------------------- | :------- | :------- |
| 6          | Route 11 detour              | Route 11 outbound detoured due to ...                      |  1       | 4        |
| 11         | Ruggles elevator unavailable | ... Commuter Rail platform to the lobby is unavailable ... |  9       | 7        |
| 33         | Extra Franklin Line service  |                                                            |  1       | 5        |

**Note: **A cause value of 1 corresponds to UNKNOWN_CAUSE, while 9
corresponds to MAINTENANCE. An effect value of 4 corresponds to DETOUR,
7 corresponds to OTHER_EFFECT, while 5 corresponds to
ADDITIONAL_SERVICE.

To determine the timing of these alerts, look up the
`gtfs_rt_alerts_timeranges` for the given alerts.

```
$ export TZ=America/New_York

$ sqlite3 db.sqlite

sqlite> SELECT alert_id, datetime(start, 'unixepoch', 'localtime'), datetime(finish, 'unixepoch', 'localtime')
  FROM gtfs_rt_alerts_timeranges
  WHERE alert_id IN (6, 11, 33);
```

***Note:** In this example, the system timezone has been temporarily
changed to America/New_York (the timezone specified in MBTA's
agency.txt file) so the timestamps are formatted correctly. You may
prefer instead to format the start and finish timestamps using your
programming language of choice.*

| `alert_id` | `start` (formatted) | `finish` (formatted) |
| :--------- | :------------------ | :------------------- |
| 6          | 2015-02-17 15:56:52 |                      |
| 11         | 2015-03-18 06:00:00 | 2015-03-18 16:00:00  |
| 11         | 2015-03-19 06:00:00 | 2015-03-19 16:00:00  |
| 33         | 2015-03-16 10:58:38 | 2015-03-18 02:30:00  |

These dates indicate the following:

* The alert with an ID of 6 began on February 17 and has no specified end date.
* The alert with an ID of 11 will occur over two days between 6 AM and 4 PM.
  *The alert with an ID of 33 will last for almost two days.

Finally, to determine which entities (routes, trips, stops) are affected
by these alerts, query the `gtfs_rt_alerts_entities` table.

```sql
SELECT alert_id, agency_id, route_id, route_type, stop_id, trip_id, trip_start_date, trip_start_time, trip_rship
  FROM gtfs_rt_alerts_entities WHERE alert_id IN (6, 11, 33);
```

This results in the following data, describing only an entity for the
final service alert.

| `alert_id` | `agency_id` | `route_id`  | `route_type` | `stop_id` | `trip_id` | `trip_start_date` | `trip_start_time` | `trip_rship` |
| :--------- | :---------- | :---------- | :----------- | :-------- | :-------- | :---------------- | :---------------- | :----------- |
| 6          | 1           | CR-Franklin | 2            |           |           |                   |                   |              |

Using the `route_id` value, you can find the route from the GTFS feed:

```sql
SELECT agency_id, route_type, route_long_name, route_desc
  FROM routes WHERE route_id = 'CR-Franklin';
```

This query will result in the following data:

| `agency_id` | `route_type` | `route_long_name` | `route_desc`  |
| :---------- | :----------- | :---------------- | :------------ |
| 1           | 2            | Franklin Line     | Commuter Rail |

In effect, this means that if you have a web site or app displaying the
schedule for the Franklin line, this alert should be displayed so the
people who use the line are aware of the change.

***Note:** Even though the `agency_id` and `route_type` values are included,
you do not really need these since the `route_id` can only refer to one
row in the GTFS feed. If instead you wanted to refer to ALL rail lines,
the alert would have the `route_id` value blank but keep the `route_type`
value.*