aboutsummaryrefslogtreecommitdiff
path: root/gtfs-book/ch-14-switching-to-integer-ids.md
blob: 4d3914a3b8cee243776b35ef6481aa05ddba9924 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
## 14. Switching to Integer IDs

There are a number of instances in a GTFS feed where IDs are used, such
as to identify routes, trips, stops and shapes. There are no specific
guidelines in GTFS as to the type of data or length an ID can be. As
such, IDs in some GTFS feeds maybe anywhere up to 30 or 40 characters
long.

Using long strings as IDs is extremely inefficient as they make the size
of a database much larger than it needs to be, as well as making
querying the data much slower.

To demonstrate, consider a GTFS feed where trip IDs are 30 characters
long. If there are 10,000 trips, each with an average of 30 stops in
`stop_times.txt`, then the IDs alone take up 9.3 MB of storage.
Realistically speaking, you need to index the `trip_id` field in order
to look up a trip's stop times quickly, which uses even more space.

The following SQL statements show how you might represent GTFS without
optimizing the identifiers. For brevity, not all fields from the GTFS
feed are included here.

```sql
CREATE TABLE trips (
  trip_id TEXT,
  route_id TEXT,
  service_id TEXT
);

CREATE INDEX trips_trip_id ON trips (trip_id);

CREATE INDEX trips_route_id ON trips (route_id);

CREATE INDEX trips_service_id ON trips (service_id);

CREATE TABLE stop_times (
  trip_id TEXT,
  stop_id TEXT,
  stop_sequence INTEGER
);

CREATE INDEX stop_times_trip_id ON stop_times (trip_id);

CREATE INDEX stop_times_stop_id on stop_times (stop_id);
```

If you were to add an integer column to `trips` called, say,
`trip_index`, then you can reference that value from `stop_times`
instead of `trip_id`. The following SQL statements show this.

```sql
CREATE TABLE trips (
  trip_id TEXT,
  trip_index INTEGER,
  route_id TEXT,
  service_id TEXT
);

CREATE INDEX trips_trip_id ON trips (trip_id);
CREATE INDEX trips_trip_index ON trips (trip_index);
CREATE INDEX trips_route_id ON trips (route_id);
CREATE INDEX trips_service_id ON trips (service_id);

CREATE TABLE stop_times (
  trip_index INTEGER,
  stop_id TEXT,
  stop_sequence INTEGER
);

CREATE INDEX stop_times_trip_index ON stop_times (trip_index);

CREATE INDEX stop_times_stop_id on stop_times (stop_id);
```

This results in a significant space saving (when you consider how large
`stop_times` can be), as well as being far quicker to look up stop
times based on a trip ID. Note that the original `trip_id` value is
retained so it can be referenced if required.

Without adding `trip_index`, you would use the following query to find
stop times given a trip ID.

```sql
SELECT * FROM stop_times
  WHERE trip_id = 'SOME_LONG_TRIP_ID'
  ORDER BY stop_sequence;
```

With the addition of `trip_index`, you need to first find the record
in `trips`. This can be achieved using the following query. This is a
small sacrifice compared to performing string comparison on all stop
times.

```sql
SELECT * FROM stop_times
  WHERE trip_index = (
    SELECT trip_index FROM trips WHERE trip_id = 'SOME_LONG_TRIP_ID'
  )
  ORDER BY stop_sequence;
```

You can make the same change for the other IDs in the feed, such as
`route_id` and `stop_id`. For these columns you still keep (and
index) the original values in `routes` and `stops` respectively,
since you may still need to look up records based on these values.

***Note:** Even though this book recommends optimizing feeds in this
manner, the remainder of examples in this book only use their original
IDs, in order to simplify the examples and to ensure compatibility with
the `GtfsToSql` tool introduced previously.*