aboutsummaryrefslogtreecommitdiff
path: root/.idea/gtfs-book/ch-17-searching-for-trips.md
blob: dd0a90169ca11498f69d1992ac955667a194c521 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
## 17. Searching for Trips**

This section shows you how to search for trips in a GTFS feed based on
specified times and stops.

The three scenarios covered are:

* Finding all stops departing from a given stop after a certain time
* Finding all stops arriving at a given stop before a certain time
* Finding all trips between two stops after a given time

The first and second scenarios are the simplest, because they only rely
on a single end of each trip. The third scenario is more complex because
you have to ensure that each trip returned visits both the start and
finish stops.

When searching for trips, the first thing you need to know is which
services are running for the given search time.

### Finding Service IDs

The first step in searching for trips is to determine which services are
running. To begin with, you need to find service IDs for a given date.
You then need to handle exceptions accordingly. That is, you need to add
service IDs and remove service IDs based on the rules in
`calendar_dates.txt`.

***Note:** In *Scheduling Past Midnight*, you were shown how
GTFS works with times past midnight. The key takeaway from this is that
you have to search for trips for two sets of service IDs. This is
covered as this chapter progresses.*

In Australia, 27 January 2014 (Monday) was the holiday for Australia
Day. This is used as an example to demonstrate how to retrieve service
IDs.

Firstly, you need the main set of service IDs for the day. The following
SQL query achieves this.

```sql
SELECT service_id FROM calendar
  WHERE start_date <= '20140127' AND end_date >= '20140127'
  AND monday = 1;
 
# Result: 1, 2, 6, 9, 18, 871, 7501
```

Next you need to find service IDs that are to be excluded, as achieved
by the following SQL query.

```sql
SELECT service_id FROM calendar_dates
  WHERE date = '20140127' AND exception_type = 2;

# Result: 1, 2, 6, 9, 18, 871, 7501
```

Finally, you need to find service IDs that are to be added. This query
is identical to the previous query, except for the different
`exception_type` value.

```sql
SELECT service_id FROM calendar_dates
  WHERE date = '20140127' AND exception_type = 1;

# Result: 12, 874, 4303, 7003
```

You can combine these three queries all into a single query in SQLite
using `EXCEPT` and `UNION`, as shown in the following SQL query.

```sql
SELECT service_id FROM calendar
  WHERE start_date <= '20140127' AND end_date >= '20140127'
  AND monday = 1
  
UNION
  
SELECT service_id FROM calendar_dates
  WHERE date = '20140127' AND exception_type = 1
  
EXCEPT
  
SELECT service_id FROM calendar_dates
  WHERE date = '20140127' AND exception_type = 2;

# Result: 12, 874, 4304, 7003
```

Now when you search for trips, only trips that have a matching
`service_id` value are included.

### Finding Trips Departing a Given Stop

In order to determine the list of services above, a base timestamp on
which to search is needed. For the purposes of this example, assume that
timestamp is 27 January 2014 at 1 PM (`13:00:00` when using GTFS).

This example searches for all services departing from Adelaide Railway
Station, which has stop ID `6665` in the Adelaide Metro GTFS feed. To
find all matching stop times, the following query can be performed.

```sql
SELECT * FROM stop_times
  WHERE stop_id = '6665'
  AND departure_time >= '13:00:00'
  AND pickup_type = 0
  ORDER BY departure_time;
```

This returns a series of stop times that match the given criteria. The
only problem is it does not yet take into account valid service IDs.

***Note:** This query may also return stop times that are the final stop
on a trip, which is not useful for somebody trying to find departures.
You may want to modify your database importer to override the final stop
time of each trip so its `pickup_type` has a value of `1` (no pick-up) and
its first stop time so it has a `drop_off_type` of `1` (no drop-off).*

To make sure only the correct trips are returned, join `stop_times`
with `trips` using `trip_id`, and then include the list of service
IDs. For the purposes of this example the service IDs, stop ID and
departure time are being hard-coded. You can either embed a sub-query,
or include the service IDs via code.

```sql
SELECT t.*, st.* FROM stop_times st, trips t
  WHERE st.stop_id = '6665'
  AND st.trip_id = t.trip_id
  AND t.service_id IN ('12', '874', '4304', '7003')
  AND st.departure_time >= '13:00:00'
  AND st.pickup_type = 0
  ORDER BY st.departure_time;
```

This gives you the final list of stop times matching the desired
criteria. You can then decide specifically which data you need to
retrieve; you now have the `trip_id`, meaning you can find all stop
times for a given trip if required.

If you need to restrict the results to only those that occur after the
starting stop, you can retrieve stop times with only a `stop_sequence`
larger than that of the stop time returned in the above query.

### Finding Trips Arriving at a Given Stop

In order to find the trips arriving at a given stop before a specified
time, it is just a matter of making slight modifications to the above
query. Firstly, check the `arrival_time` instead of
`departure_time`. Also, check the `drop_off_type` value instead of
`pickup_type`.

```sql
SELECT t.*, st.* FROM stop_times st, trips t
  WHERE st.stop_id = '6665'
  AND st.trip_id = t.trip_id
  AND t.service_id IN ('12', '874', '4304', '7003')
  AND st.arrival_time <= '13:00:00'
  AND st.drop_off_type = 0
  ORDER BY st.arrival_time DESC;
```

For this particular data set, there are trips from four different routes
returned. If you want to restrict this to a particular route, you can
filter on the `t.route_id` value.

```sql
SELECT t.*, st.* FROM stop_times st, trips t
  WHERE st.stop_id = '6665'
  AND st.trip_id = t.trip_id
  AND t.service_id IN ('12', '874', '4304', '7003')
  AND t.route_id = 'BEL'
  AND st.arrival_time <= '13:00:00'
  AND st.drop_off_type = 0
  ORDER BY st.arrival_time DESC;
```

### Performance of Searching Text-Based Times

These examples search using text-based arrival/departure times (such as
`13:00:00`). This works because the GTFS specification mandates that
all times are `HH:MM:SS` format (although `H:MM:SS` is allowed for times
earlier than 10 AM).

Doing this kind of comparison (especially if you are scanning millions
of rows) is quite slow and expensive. It is more efficient to convert
all times stored in the database to integers that represent the number
of seconds since midnight.

***Note:** The GTFS specification states that arrival and departure times
are "noon minus 12 hours" in order to account for daylight savings time.
This is effectively midnight, except for the days that daylight savings
starts or finishes.*

In order to achieve this, you can convert the text-based time to an
integer with `H * 3600 + M * 60 + S`. For example, `13:35:21` can
be converted using the following steps.

```
  (13 * 3600) + (35 * 60) + (21)
= 46800 + 2100 + 21
= 48921
```

You can then convert back to hours, minutes and seconds in order to
generate timestamps in your application as shown in the following
algorithm.

```
H = floor( 48921 / 3600 )
  = floor( 13.59 )
  = 13
  
M = floor( 48921 / 60 ) % 60
  = floor( 815.35 ) % 60
  = 815 % 60
  = 35
  
S = 48921 % 60
  = 21
```

### Finding Trips Between Two Stops

Now that you know how to look up a trip from or to a given stop, the
previous query can be expanded so both the start and finish stop are
specified. The following example finds trips that depart after 1 PM. The
search only returns trips departing from *Adelaide Railway Station*
(stop ID `6665`) are shown. Additionally, only trips that then visit
*Blackwood Railway Station* (stop IDs `6670` and `101484`) are
included.

In order to achieve this, the following changes must be made to the
previous examples.

* **Join against stop_times twice.** Once for the departure stop time
  and once for the arrival stop time.
* **Allow for multiple stop IDs at one end.** The destination in this
  example has two platforms, so you need to check both of them.
* **Ensure the departure time is earlier than the arrival time.
  **Otherwise trips heading in the opposite direction may also be
  returned.

The following query demonstrates how this is achieved.

```sql
SELECT t.*, st1.*, st2.*
  FROM trips t, stop_times st1, stop_times st2
  WHERE st1.trip_id = t.trip_id
  AND st2.trip_id = t.trip_id
  AND st1.stop_id = '6665'
  AND st2.stop_id IN ('6670', '101484')
  AND t.service_id IN ('12', '874', '4304', '7003')
  AND st1.departure_time >= '13:00:00'
  AND st1.pickup_type = 0
  AND st2.drop_off_type = 0
  AND st1.departure_time < st2.arrival_time
  ORDER BY st1.departure_time;
```

In this example, the table alias `st1` is used for the departure stop
time. Once again, the stop ID must match, as well as the departure time
and pick-up type.

For the arrival stop time the alias `st2` is used. This table also
joins the `trips` table using `trip_id`. Since the destination has
multiple stop IDs, the SQL `IN` construct is used. The arrival time is
not important in this example, so only the departure is checked.

The final thing to check is that the departure occurs before the
arrival. If you do not perform this step, then trips traveling in the
opposite direction may also be returned.

***Note:** Technically, there may be multiple results returned for the
same trip. For some transit agencies, a single trip may visit the a stop
more than once. If this is the case, you should also check the trip
duration (arrival time minus departure time) and use the shortest trip
when the same trip is returned multiple times.*

### Accounting for Midnight

As discussed previously, GTFS has the ability for the trips to depart or
arrive after midnight for a given service day without having to specify
it as part of the next service day. Consequently, while the queries
above are correct, they do not necessarily paint the full picture.

In reality, when performing a trip search, you need to take into account
trips that have wrapped times (for instance, where 12:30 AM is specified
as `24:30:00`). If you want to find trips that depart after 12:30 AM
on a given day, you need to check for trips departing after
`00:30:00` on that day, as well as for trips departing at
`24:30:00` on the previous day.

This means that for each trip search you are left with two sets of
trips, which you must then merge and present as appropriate.

***Note:** In reality, agencies generally do not have trips that overlap
from multiple service days, so technically you often only need one query
(for example, a train service might end on 12:30 AM then restart on the
next service day at 4:30 AM). If your app / web site only uses a single
feed where you can tune your queries manually based on how the agency
operates, then you can get away with only querying a single service day.
On the other hand, if you are building a scalable system that works with
data from many agencies, then you need to check both days.*

To demonstrate how this works in practice, the following example
searches for all trips that depart after 12:30:00 AM on 14 March 2014.
The examples earlier in this chapter showed how to find the service IDs
for a given date. To account for midnight, service IDs for both March 14
(the "main" service date) and March 13 (the overlapping date) need to be
determined.

Assume that March 13 has a service ID of `C1` and March 14 has a
service ID of `C2`. First you need to find the departures for March
14, as shown in the following query.

```sql
SELECT t.*, st.* FROM stop_times st, trips t
  WHERE st.stop_id = 'S1'
  AND st.trip_id = t.trip_id
  AND t.service_id IN ('C1')
  AND st.departure_time >= '00:30:00'
  AND st.pickup_type = 0
  ORDER BY st.departure_time;
```

The resultant list needs to be combined with the trips that depart after
midnight from the March 13 service. To check this, it is just a matter
of swapping in the right service IDs, then adding 24 hours to the search
time.

```sql
SELECT t.*, st.* FROM stop_times st, trips t
  WHERE st.stop_id = 'S1'
  AND st.trip_id = t.trip_id
  AND t.service_id IN ('C2')
  AND st.departure_time >= '24:30:00'
  AND st.pickup_type = 0
  ORDER BY st.departure_time;
```

In order to get your final list of trips you must combine the results
from both of these queries. If you are generating complete timestamps in
order to present the options to your users, just remember to account for
the results from the second query being 24 hours later.