There are a number of instances in a GTFS feed where IDs are used, such as to identify routes, trips, stops and shapes. There are no specific guidelines in GTFS as to the type of data or length an ID can be. As such, IDs in some GTFS feeds maybe anywhere up to 30 or 40 characters long.
Using long strings as IDs is extremely inefficient as they make the size of a database much larger than it needs to be, as well as making querying the data much slower.
To demonstrate, consider a GTFS feed where trip IDs are 30 characters
long. If there are 10,000 trips, each with an average of 30 stops in
stop_times.txt, then the IDs alone take up 9.3 MB of storage.
Realistically speaking, you need to index the
trip_id field in order
to look up a trip's stop times quickly, which uses even more space.
The following SQL statements show how you might represent GTFS without optimizing the identifiers. For brevity, not all fields from the GTFS feed are included here.
CREATE TABLE trips ( trip_id TEXT, route_id TEXT, service_id TEXT ); CREATE INDEX trips_trip_id ON trips (trip_id); CREATE INDEX trips_route_id ON trips (route_id); CREATE INDEX trips_service_id ON trips (service_id); CREATE TABLE stop_times ( trip_id TEXT, stop_id TEXT, stop_sequence INTEGER ); CREATE INDEX stop_times_trip_id ON stop_times (trip_id); CREATE INDEX stop_times_stop_id on stop_times (stop_id);
If you were to add an integer column to
trips called, say,
trip_index, then you can reference that value from
trip_id. The following SQL statements show this.
CREATE TABLE trips ( trip_id TEXT, trip_index INTEGER, route_id TEXT, service_id TEXT ); CREATE INDEX trips_trip_id ON trips (trip_id); CREATE INDEX trips_trip_index ON trips (trip_index); CREATE INDEX trips_route_id ON trips (route_id); CREATE INDEX trips_service_id ON trips (service_id); CREATE TABLE stop_times ( trip_index INTEGER, stop_id TEXT, stop_sequence INTEGER ); CREATE INDEX stop_times_trip_index ON stop_times (trip_index); CREATE INDEX stop_times_stop_id on stop_times (stop_id);
This results in a significant space saving (when you consider how large
stop_times can be), as well as being far quicker to look up stop
times based on a trip ID. Note that the original
trip_id value is
retained so it can be referenced if required.
trip_index, you would use the following query to find
stop times given a trip ID.
SELECT * FROM stop_times WHERE trip_id = 'SOME_LONG_TRIP_ID' ORDER BY stop_sequence;
With the addition of
trip_index, you need to first find the record
trips. This can be achieved using the following query. This is a
small sacrifice compared to performing string comparison on all stop
SELECT * FROM stop_times WHERE trip_index = ( SELECT trip_index FROM trips WHERE trip_id = 'SOME_LONG_TRIP_ID' ) ORDER BY stop_sequence;
You can make the same change for the other IDs in the feed, such as
stop_id. For these columns you still keep (and
index) the original values in
since you may still need to look up records based on these values.
Note: Even though this book recommends optimizing feeds in this
manner, the remainder of examples in this book only use their original
IDs, in order to simplify the examples and to ensure compatibility with
GtfsToSql tool introduced previously.