aboutsummaryrefslogtreecommitdiff
path: root/gtfs-book/ch-14-switching-to-integer-ids.md
diff options
context:
space:
mode:
Diffstat (limited to 'gtfs-book/ch-14-switching-to-integer-ids.md')
-rw-r--r--gtfs-book/ch-14-switching-to-integer-ids.md111
1 files changed, 111 insertions, 0 deletions
diff --git a/gtfs-book/ch-14-switching-to-integer-ids.md b/gtfs-book/ch-14-switching-to-integer-ids.md
new file mode 100644
index 0000000..4d3914a
--- /dev/null
+++ b/gtfs-book/ch-14-switching-to-integer-ids.md
@@ -0,0 +1,111 @@
+## 14. Switching to Integer IDs
+
+There are a number of instances in a GTFS feed where IDs are used, such
+as to identify routes, trips, stops and shapes. There are no specific
+guidelines in GTFS as to the type of data or length an ID can be. As
+such, IDs in some GTFS feeds maybe anywhere up to 30 or 40 characters
+long.
+
+Using long strings as IDs is extremely inefficient as they make the size
+of a database much larger than it needs to be, as well as making
+querying the data much slower.
+
+To demonstrate, consider a GTFS feed where trip IDs are 30 characters
+long. If there are 10,000 trips, each with an average of 30 stops in
+`stop_times.txt`, then the IDs alone take up 9.3 MB of storage.
+Realistically speaking, you need to index the `trip_id` field in order
+to look up a trip's stop times quickly, which uses even more space.
+
+The following SQL statements show how you might represent GTFS without
+optimizing the identifiers. For brevity, not all fields from the GTFS
+feed are included here.
+
+```sql
+CREATE TABLE trips (
+ trip_id TEXT,
+ route_id TEXT,
+ service_id TEXT
+);
+
+CREATE INDEX trips_trip_id ON trips (trip_id);
+
+CREATE INDEX trips_route_id ON trips (route_id);
+
+CREATE INDEX trips_service_id ON trips (service_id);
+
+CREATE TABLE stop_times (
+ trip_id TEXT,
+ stop_id TEXT,
+ stop_sequence INTEGER
+);
+
+CREATE INDEX stop_times_trip_id ON stop_times (trip_id);
+
+CREATE INDEX stop_times_stop_id on stop_times (stop_id);
+```
+
+If you were to add an integer column to `trips` called, say,
+`trip_index`, then you can reference that value from `stop_times`
+instead of `trip_id`. The following SQL statements show this.
+
+```sql
+CREATE TABLE trips (
+ trip_id TEXT,
+ trip_index INTEGER,
+ route_id TEXT,
+ service_id TEXT
+);
+
+CREATE INDEX trips_trip_id ON trips (trip_id);
+CREATE INDEX trips_trip_index ON trips (trip_index);
+CREATE INDEX trips_route_id ON trips (route_id);
+CREATE INDEX trips_service_id ON trips (service_id);
+
+CREATE TABLE stop_times (
+ trip_index INTEGER,
+ stop_id TEXT,
+ stop_sequence INTEGER
+);
+
+CREATE INDEX stop_times_trip_index ON stop_times (trip_index);
+
+CREATE INDEX stop_times_stop_id on stop_times (stop_id);
+```
+
+This results in a significant space saving (when you consider how large
+`stop_times` can be), as well as being far quicker to look up stop
+times based on a trip ID. Note that the original `trip_id` value is
+retained so it can be referenced if required.
+
+Without adding `trip_index`, you would use the following query to find
+stop times given a trip ID.
+
+```sql
+SELECT * FROM stop_times
+ WHERE trip_id = 'SOME_LONG_TRIP_ID'
+ ORDER BY stop_sequence;
+```
+
+With the addition of `trip_index`, you need to first find the record
+in `trips`. This can be achieved using the following query. This is a
+small sacrifice compared to performing string comparison on all stop
+times.
+
+```sql
+SELECT * FROM stop_times
+ WHERE trip_index = (
+ SELECT trip_index FROM trips WHERE trip_id = 'SOME_LONG_TRIP_ID'
+ )
+ ORDER BY stop_sequence;
+```
+
+You can make the same change for the other IDs in the feed, such as
+`route_id` and `stop_id`. For these columns you still keep (and
+index) the original values in `routes` and `stops` respectively,
+since you may still need to look up records based on these values.
+
+***Note:** Even though this book recommends optimizing feeds in this
+manner, the remainder of examples in this book only use their original
+IDs, in order to simplify the examples and to ensure compatibility with
+the `GtfsToSql` tool introduced previously.*
+