The Definitive Guide to GTFS

Consuming open public transportation data with the General Transit Feed Specification

Originally written by Quentin Zervaas.

About This Book

This book is a comprehensive guide to GTFS -- the General Transit Feed Specification. It is comprised of two main sections.

The first section describes what GTFS is and provides details about the specification itself. In addition to this it also provides various discussion points and things to consider for each of the files in the specification.

The second section covers a number of topics that relate to actually using GTFS feeds, such as how to calculate fares, how to search for trips, how to optimize feed data and more.

This book is written for developers that are using transit data for web sites, mobile applications and more. It aims to be as language-agnostic as possible, but uses SQL to demonstrate concepts of extracting data from a GTFS feed.

About The Author

Quentin Zervaas is a software developer from Adelaide, Australia.

Quentin was the founder of TransitFeeds (now <OpenMobilityData.org>), a web site that provides a comprehensive listing of public transportation data available around the world. This site is referenced various times throughout this book.

Credits

First Edition. Published in February 2014.

Technical Reviewer

Rupert Hanson

Copy Editor

Miranda Little

Disclaimer

The information in this book is distributed on an "as is" basis, without warranty. Although every precaution has been taken in the preparation of this work, the author shall not be liable to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in this book.

Note

This work is licensed under the Creative Commons Attribution 4.0 International License, and was published at https://github.com/MobilityData/GTFS-books.

1. Introduction to GTFS

GTFS (General Transit Feed Specification) is a data standard developed by Google used to describe a public transportation system. Its primary purpose was to enable public transit agencies to upload their schedules to Google Transit so that users of Google Maps could easily figure out which bus, train, ferry or otherwise to catch.

A GTFS feed is a ZIP file that contains a series of CSV files that list routes, stops and trips in a public transportation system.

This book examines GTFS in detail, including which data from a public transportation system can be represented, how to extract data, and explores some more advanced techniques for optimizing and querying data.

The official GTFS specification has been referenced a number of times in this book. It is strongly recommended you are familiar with it. You can view at the following URL:

https://developers.google.com/transit/gtfs/reference

Structure of a GTFS Feed

A GTFS feed is a series of CSV files, which means that it is trivial to include additional files in a feed. Additionally, files required as part of the specification can also include additional columns. For this reason, feeds from different agencies generally include different levels of detail.

Note: The files in a GTFS feed are CSV files, but use a file extension of .txt.

A GTFS feed can be described as follows:

A GTFS feed has one or more routes. Each route (routes.txt) has one or more trips (trips.txt). Each trip visits a series of stops (stops.txt) at specified times (stop_times.txt). Trips and stop times only contain time of day information; the calendar is used to determine on which days a given trip runs (calendar.txt and calendar_dates.txt).

The following chapters cover the main files that are included in all GTFS feeds. For each file, the main columns are covered, as well as optional columns that can be included. This book also covers some of the unofficial columns that some agencies choose to include.

2. Agencies (agency.txt)

This file is required to be included in GTFS feeds.

The agency.txt file is used to represent the agencies that provide data for this feed. While its presence is optional, if there are routes from multiple agencies included, then records in routes.txt make reference to agencies in this file.

FieldRequired?Description
agency_idOptionalAn ID that uniquely identifies a single transit agency in the feed. If a feed only contains routes for a single agency then this value is optional.
agency_nameRequiredThe full name of the transit agency.
agency_urlRequiredThe URL of the transit agency. Must be a complete URL only, beginning with http:// or https://.
agency_timezoneRequiredTime zone of agency. All times in stop_times.txt use this time zone, unless overridden by its corresponding stop. All agencies in a single feed must use the same time zone. Example: America/New_York (See http://en.wikipedia.org/wiki/List_of_tz_database_time_zones for more examples)
agency_langRequiredContains a two-letter ISO-639-1 code (such as en or EN for English) for the language used in this feed.
agency_phoneOptionalA single voice telephone number for the agency that users can dial if required.
agency_fare_urlOptionalA URL that describes fare information for the agency. Must be a complete URL only, beginning with http:// or https://.

Sample Data

The following extract is taken from the GTFS feed of TriMet (Portland, USA), located at https://openmobilitydata.org/p/trimet.

agency_nameagency_urlagency_timezoneagency_langagency_phone
TriMet[https://trimets.org](https://trimet.org/)America/Los_Angelesen(503) 238-7433

In this example, the agency_id column is included, but as there is only a single entry the value can be empty. This means the agency_id column in routes.txt also is not required.

Discussion

The data in this file is typically used to provide additional information to users of your app or web site in case schedules derived from the rest of this feed are not sufficient (or in the case of agency_fare_url, an easy way to provide a reference point to users if the fare information in the feed is not being used).

If you refer to the following screenshot, taken from Google Maps, you can see the information from agency.txt represented in the lower-left corner as an example of how it can be used.

GTFS agency

3. Stops & Stations (stops.txt)

This file is required to be included in GTFS feeds.

The individual locations where vehicles pick up or drop off passengers are represented by stops.txt. Records in this file are referenced in stop_times.txt. A record in this file can be either a stop or a station. A station has one or more child stops, as indicated using the parent_station value. Entries that are marked as stations may not appear in stop_times.txt.

FieldRequired?Description
stop_idRequiredAn ID to uniquely identify a stop or station.
stop_codeOptionalA number or short string used to identify a stop to passengers. This is typically displayed at the physical stop or on printed schedules.
stop_nameRequiredThe name of the stop as passengers know it by.
stop_descOptionalA description of the stop. If provided, this should provide additional information to the stop_name value.
stop_latRequiredThe latitude of the stop (a number in the range of -90 to 90).
stop_lonRequiredThe longitude of the stop (a number in the range of -180 to 180).
zone_idOptionalThis is an identifier used to calculate fares. A single zone ID may appear in multiple stops, but is ignored if the stop is marked as a station.
stop_urlOptionalA URL that provides information about this stop. It should be specific to this stop and not simply link to the agency's web site.
location_typeOptionalIndicates if a record is a stop or station. 0 or blank means a stop, 1 means a station.
parent_stationOptionalIf a record is marked as a stop and has a parent station, this contains the ID of the parent (the parent must have a location_type of 1).
stop_timezoneOptionalIf a stop is located in a different time zone to the one specified in agency.txt, then it can be overridden here.
wheelchair_boardingOptionalA value of 1 indicates it is possible for passengers in wheelchairs to board or alight. A value of 2 means the stop is not wheelchair accessible, while 0 or an empty value means no information is available. If the stop has a parent station, then 0 or an empty value means to inherit from its parent.

Sample Data

The following extract is taken from the TriMet GTFS feed (https://openmobilitydata.org/p/trimet).

stop_idstop_codestop_namestop_latstop_lonstop_url
22A Ave & Chandler45.420595-122.675676 <http://trimet.org/arrivals/tracker?locationID=2>
33A Ave & Second St45.419386-122.665341 <http://trimet.org/arrivals/tracker?locationID=3>
44A Ave & 10th St45.420703-122.675152 <http://trimet.org/arrivals/tracker?locationID=4>
66A Ave & 8th St45.420217-122.67307 <http://trimet.org/arrivals/tracker?locationID=6>

The following diagram shows how these points look if you plot them onto a map.

Stops

In this extract, TriMet use the same value for stop IDs and stop codes. This is useful, because it means the stop IDs are stable (that is, they do not change between feed versions). This means that if you want to save a particular stop (for instance, if a user wants to save a "favorite stop") you can trust that saving the ID will get the job done.

Note: This is not always the case though, which means you may have to save additional information if you want to save a stop. For instance, you may need to save the coordinates or the list of routes a stop serves so you can find it again if the stop ID has changed in a future version of the feed.

Stops & Stations

Specifying an entry in this file as a station is typically used when there are many stops located within a single physical entity, such as a train station or bus depot. While many feeds do not offer this information, some large train stations may have up to 20 or 30 platforms.

Knowing the platform for a specific trip is extremely useful, but if a passenger wants to select a starting point for their trip, showing them a list of platforms may be confusing.

Passenger:

"I want to travel from Central Station to Airport Station."

Web site / App:

"Board at Central Station platform 5, disembark at Airport Station platform 1."

In this example, the passenger selects the parent station, but they are presented with the specific stop so they know exactly where within the station they need to embark or disembark.

Wheelchair Accessibility

If you are showing wheelchair accessibility information, it is important to differentiate between "no access" and "no information", as knowing a stop is not accessible is as important as knowing it is.

If a stop is marked as being wheelchair accessible, you must check that trips that visit the stop are also accessible (using the wheelchair_accessible field in trips.txt). If the value in trips.txt is blank, 0 or 1 then it is safe to assume the trip can be accessed. If the stop is accessible and the trip is not, then passengers in wheelchairs cannot use the trip.

Stop Features

One of the proposed changes to GTFS is the addition of a file called stop_features.txt. This is used to define characteristics about stops. The great thing about this file is that it allows you to indicate to users when a stop has a ticket machine, bike storage, lighting, or an electronic display with real-time information.

TriMet is one of the few agencies including this file. The following is a sample of this file.

stop_idfeature_type
614110
612310
615200

This data indicate that stop 61 (NE Alberta & 24th) has a Printed Schedule Display (4110), a Bike Rack (2310) and a Street Light (5200).

For more information about this proposal and a list of values and their meanings, refer to https://sites.google.com/site/gtfschanges/proposals/stop-amenity.

4. Routes (routes.txt)

This file is required to be included in GTFS feeds.

A route is a group of trips that are displayed to riders as a single service.

FieldRequired?Description
route_idRequiredAn ID that uniquely identifies the route.
agency_idOptionalThe ID of the agency a route belongs to, as it appears in agency.txt. Only required if there are multiple agencies in the feed.
route_short_nameRequiredA nickname or code to represent this service. If this is left empty then the route_long_name must be included.
route_long_nameRequiredThe route full name. If this is left empty then the route_short_name must be included.
route_descOptionalA description of the route, such as where and when the route operates.
route_typeRequiredThe type of transportation used on a route (such as bus, train or ferry). See below for more information.
route_urlOptionalA URL of a web page that describes this particular route.
route_colorOptionalIf applicable, a route can have a color assigned to it. This is useful for systems that use colors to identify routes. This value is a six-character hexadecimal number (for example, FF0000 is red).
route_text_colorOptionalFor routes that specify the route_color, a corresponding text color should also be specified.

Sample Data

The following extract is taken from the TriMet GTFS feed (https://openmobilitydata.org/p/trimet).

route_idroute_short_nameroute_long_nameroute_type
11Vermont3
44Division / Fessenden3
66Martin Luther King Jr Blvd3

This sample shows three different bus routes for the greater Portland area. The route_type value of 3 indicates they are buses. See the next section for more information about route types in GTFS.

There is no agency ID value in this feed, as TriMet is the only agency represented in the feed.

The other thing to note about this data is that TriMet use the same value for both route_id and route_short_name. This is very useful, because it means if you have a user that wants to save information about a particular route you can trust the route_id value. Unfortunately, this is not the case in all GTFS feeds. Sometimes, the route_id value may change with every version of a feed (or at least, semi-frequently). Additionally, some feeds may also have multiple routes with the same route_short_name. This can present challenges when trying to save user data.

Route Types

To indicate a route's mode of transport, the route_type column is used.

ValueDescription
0Tram / Light Rail
1Subway / Metro
2Rail
3Bus
4Ferry
5Cable Car
6Gondola
7Funicular

Agencies may interpret the meaning of these route types differently. For instance, some agencies specify their subway service as rail (value of 2 instead of 1), while some specify their trains as light rail (0 instead of 2).

These differences between agencies occur mainly because of the vague descriptions for each of these route types. If you use Google Transit to find directions, you may notice route types referenced that are different to those listed above. This is because Google Transit also supports additional route types. You can read more about these additional route types at https://support.google.com/transitpartners/answer/3520902?hl=en.

Very few GTFS feeds made available to third-party developers actually make use of these values, but it is useful to know in case you come across one that does. For instance, Sydney Buses include their school buses with a route type of 712, while other buses in the feed have route type 700.

5. Trips (trips.txt)

This file is required to be included in GTFS feeds.

The trips.txt file contains trips for each route. The specific stop times are specified in stop_times.txt, and the days each trip runs on are specified in calendar.txt and calendar_dates.txt.

FieldRequired?Description
route_idRequiredThe ID of the route a trip belongs to as it appears in routes.txt.
service_idRequiredThe ID of the service as it appears in calendar.txt or calendar_dates.txt, which identifies the dates on which a trip runs.
trip_idRequiredA unique identifier for a trip in this file. This value is referenced in stop_times.txt when specifying individual stop times.
trip_headsignOptionalThe text that appears to passengers as the destination or description of the trip. Mid-trip changes to the headsign can be specified in stop_times.txt.
trip_short_nameOptionalA short name or code to identify the particular trip, different to the route's short name. This may identify a particular train number or a route variation.
direction_idOptionalIndicates the direction of travel for a trip, such as to differentiate between an inbound and an outbound trip.
block_idOptionalA block is a series of trips conducted by the same vehicle. This ID is used to group 2 or more trips together.
shape_idOptionalThis value references a value from shapes.txt to define a shape for a trip.
wheelchair_accessibleOptional0 or blank indicates unknown, while 1 indicates the vehicle can accommodate at least one wheelchair passenger. A value of 2 indicates no wheelchairs can be accommodated.

Sample Data

Consider the following extract, taken from the trips.txt file of the TriMet GTFS feed (https://openmobilitydata.org/p/trimet).

route_idservice_idtrip_iddirection_idblock_idshape_id
1W.37842822570103185327
1W.37842822560101185327
1W.37842822550102185327
1W.37842822540103185327

This data describes four individual trips for the "Vermont" bus route (this was determined by looking up the route_id value in routes.txt). While the values for trip_id, block_id and shape_id are all integers in this particular instance, this is not a requirement. Just like service_id, there may be non-numeric characters.

As each of these trips is for the same route and runs in the same direction (based on direction_id) they can all be represented by the same shape. Note however that this is not always be the case as some agencies may start or finish trips for a single route at different locations depending on the time of the day. If this were the case, then the trip's shape would differ slightly (and therefore have a different shape to represent it in shapes.txt).

Although this example does not include trip_headsign, many feeds do include this value. This is useful for indicating to a passenger where the trip is headed. When the trip headsign is not provided in the feed, you can determine the destination by using the final stop in a trip.

Tip: If you are determining the destination based on the final stop, you can either present the stop name to the user, or you can reverse-geocode the stop's coordinates to determine the its locality.

Blocks

In the preceding example, each trip has a value for block_id. The first and the last trips here both have a block_id value of 103. This indicates that the same physical vehicle completes both of these trips. As each of these trips go in the same direction, it is likely that they start at the same location.

This means there is probably another trip in the feed for the same block that exists between the trips listed here. It would likely travel from the finishing point of the first trip (4282257) to the starting point of the other trip (4282254). If you dig a little deeper in the feed you will find the trip shown in the following table.

route_idservice_idtrip_iddirection_idblock_idshape_id
1W.37842822701103185330

This is a trip traveling in the opposite direction for the same block. It has a different shape ID because it is traveling in the opposite direction; a shape's points must advance in the same direction a trip's stop times do.

Note: You should perform some validation when grouping trips together using block IDs. For instance, if trips share a block_id value then they should also have the same service_id value. You should also check that the times do not overlap; otherwise the same vehicle would be unable to service both trips.

If you dig even further in this feed, there are actually seven different trips all using block 103 for the W.378 service period. This roughly represents a full day's work for a single vehicle.

For more discussion on blocks and how to utilize them effectively, refer to Working With Trip Blocks.

Wheelchair Accessibility

Similar to stops.txt, you can specify the wheelchair accessibility of a specific trip using the wheelchair_accessible field. While many feeds do not provide this information (often because vehicles in a fleet can be changed at the last minute, so agencies do not want to guarantee this information), your wheelchair-bound users will love you if you can provide this information.

As mentioned in the section on stops.txt, it is equally important to tell a user that a specific vehicle cannot accommodate wheelchairs as to when it can. Additionally, if the stops in a feed also have wheelchair information, then both the stop and trip must be wheelchair accessible for a passenger to be able to access a trip at the given stop.

Trip Direction

One of the optional fields in trips.txt is direction_id, which is used to indicate the general direction a vehicle is traveling. At present the only possible values are 0 to represent "inbound" and 1 to represent "outbound". There are no specific guidelines as to what each value means, but the intention is that an inbound trip on a particular route should be traveling in the opposite direction to an outbound trip.

Many GTFS feeds do not provide this information. In fact, there are a handful of feeds that include two entries in routes.txt for each route (one for each direction).

One of the drawbacks of direction_id is that there are many routes for which "inbound" or "outbound" do not actually mean anything. Many cities have loop services that start and finish each trip at the same location. Some cities have one or more loops that travel in both directions (generally indicated by "clockwise loop" and "counter-clockwise loop", or words to that effect). In these instances, the direction_id can be used to determine which direction the route is traveling.

Trip Short Name

The trip_short_name field that appears in trips.txt is used to provide a vehicle-specific code to a particular trip on a route. Based on GTFS feeds that are currently in publication, it appears there are two primary use-cases for this field:

  • Specifying a particular train number for all trains on a route
  • Specifying a route "sub-code" for a route variant.

Specifying a Train Number

For certain commuter rail systems, such as SEPTA in Philadelphia or MBTA in Boston, each train has a specific number associated with it. This number is particularly meaningful to passengers as trains on the same route may have different stopping patterns or even different features (for instance, only a certain train may have air-conditioning or wheelchair access).

Consider the following extract from SEPTA's rail feed (https://openmobilitydata.org/p/septa).

route_idservice_idtrip_idtrip_headsignblock_idtrip_short_name
AIRS5AIR_1404_V25Center City Philadelphia14041404
AIRS1AIR_402_V5Center City Philadelphia402402
AIRS1AIR_404_V5Center City Philadelphia404404
AIRS5AIR_406_V25Center City Philadelphia406406

In this data, there are four different trains all heading to the same destination. The trip_short_name is a value that can safely be presented to users as it has meaning to them. In this case, you could present the first trip to passengers as:

"Train 1404 on the Airport line heading to Center City Philadelphia."

In this particular feed, SEPTA use the same value for trip_short_name and for block_id, because the train number belongs to a specific train. This means after it completes the trip to Center City Philadelphia it continues on. In this particular feed, the following trip also exists:

"Train 1404 on the Warminster line heading to Glenside."

You can therefore think of the trip_short_name value as a "user-facing" version of block_id.

Specifying a Route Sub-Code

The other use-case for trip_short_name is for specifying a route sub-code. For instance, consider an agency that has a route with short name 100 that travels from stop S1 to stop S2. At night the agency only has a limited number of routes running, so they extend this route to also visit stop S3 (so it travels from S1 to S2 then to S3). As it is a minor variation of the main path, the agency calls this trip 100A.

The agency could either create a completely separate entry in routes.txt (so they would have 100 and 100A), or they can override the handful of trips in the evening by setting the trip_short_name to 100A. The following table shows how this example might be represented.

route_idtrip_idservice_idtrip_short_name
100T1C1
100T2C1
100T3C1
100T4C1100A

In this example the trip_short_name does not need to be set for the first three trips as they use the route_short_name value from routes.txt.

Specifying Bicycle Permissions

A common field that appears in many GTFS fields is trip_bikes_allowed, which is used to indicate whether or not passengers are allowed to take bicycles on board. This is useful for automated trip planning when bicycle options can be included in the results.

The way this field works is similar to the wheelchair information; 0 or empty means no information provided; 1 means no bikes allowed; while 2 means at least one bike can be accommodated.

Note: Unfortunately, this value is backwards when you compare it to wheelchair accessibility fields. For more discussion on this matter, refer to the topic on the Google Group for GTFS Changes (https://groups.google.com/d/topic/gtfs-changes/rEiSeKNc4cs/discussion).

6. Stop Times (stop_times.txt)

This file is required to be included in GTFS feeds.

The stop_times.txt file specifies individual stop arrivals and departures for each trip. This file is typically the largest in a GTFS feed as it contains many records that correspond to each entry in trips.txt.

FieldRequired?Description
trip_idRequiredReferences a trip from trips.txt. This ID is referenced for every stop in a trip.
arrival_timeRequiredThe arrival time in HH:MM:SS format. Can be left blank, except for at least the first and last stop time in a trip. This value is typically be the same as departure_time.
departure_timeRequiredThe departure time in HH:MM:SS format. Can be left blank, except for at least the first and last stop time in a trip. This value is typically be the same as arrival_time.
stop_idRequiredReferences a single stop from stops.txt.
stop_sequenceRequiredA unique number for a given trip to indicate the stopping order. Typically these values appear in order and increment by 1 for each stop time, but this is not always the case.
stop_headsignOptionalThis is text that appears to passengers at this stop to identify the trip's destination. It should only be used to override the trip_headsign value from trips.txt.
pickup_typeOptionalIndicates if passengers can be picked up at this stop. Sometimes a stop is drop-off only.
drop_off_typeOptionalIndicates if passengers can be dropped off at this stop. Sometimes a stop is pick-up only.
shape_dist_traveledOptionalIf a trip has an associated shape, this value indicates how far along that shape the vehicle has traveled when at this stop. Values in this file and shapes.txt must use the same unit.

Sample Data

Consider the following extract, taken from stop_times.txt in the TriMet GTFS feed (https://openmobilitydata.org/p/trimet). This represents the first ten stops of a trip for bus route 1 ("Vermont") in Portland, as covered in Sample Data for trips.txt.

trip_idarrival_timedeparture_timestop_idstop_sequenceshape_dist_traveled
428224706:47:0006:47:001317010.0
428224706:48:1806:48:1876312867.5
428224706:50:1306:50:13762532154.9
428224706:52:0706:52:07761243425.5
428224706:53:4206:53:42761654491.1
428224706:55:1606:55:161049165536.2
428224706:57:0606:57:06758876767.1
428224706:58:0006:58:00759187364.4
428224706:58:3206:58:3217598618.7
428224706:58:5006:58:50198109283.8

If you were to plot the full trip on a map (including using its shape file, as specified in shapes.txt and referenced in trips.txt), it would look like the following diagram. The first stop is selected.

Stop Times

Arrival Time vs. Departure Time

The first thing to notice about this data is that the values in arrival_time and departure_time are the same. In reality, most of the time these values are the same. The situation where these values differ is typically when a vehicle is required to wait for a period of time before departing. For instance:

"The train departs the domestic airport at 6:30am, arrives at the international terminal at 6:35am. It waits for 10 minutes for passengers who have just landed to board, then departs for the city at 6:45am."

While this is not typically something you need to worry about, be aware that some feeds differ and the holdover time could be large. A person rushing to a train wants to know the time it departs, while a husband waiting at a stop to meet his wife wants to know what time it arrives.

Note: In a situation where the difference between the arrival time and departure is small, you may be better off always displaying the earlier time to the user. The driver may view a one or two minute holdover as an opportunity to keep on time, whereas a ten or fifteen minute holdover is unlikely to be ignored as doing so would significantly alter the schedule.

Scheduling Past Midnight

One of the most important concepts to understand about stop_times.txt is that times later than midnight can be specified. For example, if a trip starts at 11:45 PM and takes an hour to complete, its finishing time is 12:45 AM the next day.

The departure_time value for the first stop is be 23:45:00, while the arrival_time value for the final stop is 24:45:00. If you were to specify the final arrival time as 00:45:00, it would be referencing 12:45 AM prior to the trip's starting time.

While you could use the stop_sequence to determine which day the trip fell on, it would be impossible to do a quick search purely based on the given final stop.

A trip may start and finish after midnight without being considered as part of the next day's service. Many transit systems shut down overnight, but may have a few services that run after midnight until about 1 AM or 2 AM. It is logical to group them all together with the same service as earlier trips, which this functionality allows you to do.

However, this has implications when searching for trips. For instance, if you want to find all trips between 12:00 AM and 1:00 AM on 30 January 2014, then you need to search:

  • Between 00:00:00 and 01:00:00 for trips with service on 20140130
  • Between 24:00:00 and 25:00:00 for trips with service on 20140129

In Searching for Trips you can see how to apply this to your trip searches.

Time Points

Most GTFS feeds provide arrival/departure times for every single stop. In reality, most agencies do not have known times (or at least, they do not publish times) for many of their stops.

Typically, for routes that describe trains, subways or ferries that make relatively few stops in a trip, all stops have a specified time. However, often for bus routes that may make many stops in a single trip, generally only the main stops have times shown on printed schedules.

Typically, the bus drivers are required to meet these time points; if they are ahead of schedule they might wait at a time point until the scheduled time; if they are running late they might try to catch up in order to adhere to the specified time point.

This means that the intermediate points are likely estimates that have been interpolated based on the amount of time between time points. If there are multiple stops in-between the time points then the distance between stops may also be used to calculate the estimate.

In actual fact, GTFS feeds do not have to specify times for all stops. The data in the following table is perfectly valid for a trip.

trip_idarrival_timestop_idstop_sequenceshape_dist_traveled
T110:00:00S110
T1S221500
T1S333000
T110:12:00S446000

Based on this data, without taking into account the distance traveled, you may estimate that the second stop arrives at 10:04 AM while the third stop arrives at 10:08.

If you consider the distance traveled, you might conclude the second stop arrives at 10:03 AM while the third stop arrives at 10:06 AM.

Some agencies include an additional column in stop_times.txt called timepoint. This is used when they specify the times for all stops but also want to indicate if only certain stops are guaranteed times.

The following table shows how this would look using the previous data as its basis.

trip_idarrival_timestop_idstop_sequenceshape_dist_traveledtimepoint
T110:00:00S1101
T110:03:00S2215000
T110:06:00S3330000
T110:12:00S4460001

This can be especially useful if you want to highlight these time points so as to represent the printed schedules accurately, or even if you are a transit agency just using the data for internal reporting.

7. Trip Schedules (calendar.txt & calendar_dates.txt)

Each of these files are optional in a GTFS feed, but at least one of them is required.

The calendar.txt file is used to indicate the range of dates on which trips are running. It works by including a start date and a finish date (typically a range of 3-6 months), then a marker for each day of the week on which it operates. If there are single-day scheduling changes that occur during this period, then the calendar_dates.txt file can be used to override the schedule for each of these days.

The following table shows the specification for calendar.txt.

FieldRequired?Description
service_idRequiredA unique ID for a single service. This value is referenced by trips in trips.txt.
start_dateRequiredThis indicates the start date for a given service, in YYYYMMDD format.
end_dateRequiredThis indicates the end date for a given service, in YYYYMMDD format.
mondayRequiredContains 1 if trips run on Mondays between the start and end dates, 0 or empty if not.
tuesdayRequiredContains 1 if trips run on Tuesdays between the start and end dates, 0 or empty if not.
wednesdayRequiredContains 1 if trips run on Wednesdays between the start and end dates, 0 or empty if not.
thursdayRequiredContains 1 if trips run on Thursdays between the start and end dates, 0 or empty if not.
fridayRequiredContains 1 if trips run on Fridays between the start and end dates, 0 or empty if not.
saturdayRequiredContains 1 if trips run on Saturdays between the start and end dates, 0 or empty if not.
sundayRequiredContains 1 if trips run on Sundays between the start and end dates, 0 or empty if not.

As mentioned above, the calendar_dates.txt file is used to define exceptions to entries in calendar.txt. For instance, if a 3-month service is specified in calendar.txt and a holiday lies on a Monday during this period, then you can use calendar_dates.txt to override this single date.

If the weekend schedule were used for a holiday, then you would add a record to remove the regular schedule for the holiday date, and another record to add the weekend schedule for the holiday date.

Some feeds choose only to include calendar_dates.txt and not calendar.txt, in which case there is an "add service" record for every service and every date in this file.

The following table shows the specification for calendar_dates.txt.

FieldRequired?Description
service_idRequiredThe service ID that an exception is being defined for. This is referenced in both calendar.txt and in trips.txt. Unlike calendar.txt, it is possible for a service_id value to appear multiple times in this file.
dateRequiredThe date for which the exception is occurring, in YYYYMMDD format.
exception_typeRequiredThis indicates whether the exception is denoting an added service (1) or a removed service (2).

Sample Data

The following is an extract from the calendar.txt file in Adelaide Metro's GTFS feed (https://openmobilitydata.org/p/adelaide-metro). This extract includes schedules from the start of 2014 until the end of March 2014.

service_idmondaytuesdaywednesdaythursdayfridaysaturdaysundaystart_dateend_date
111111002014010220140331
1100000102014010220140331
1200000012014010220140331

Any trip in the corresponding trips.txt file with a service_id value of 1 runs from Monday to Friday. Trips with a service_id of 11 run only on Saturday, while those with 12 run only on Sunday.

Now consider an extract from calendar_dates.txt from the same feed, as shown in the following table.

service_iddateexception_type
1201401272
1201403102
12201401271
12201403101

The first two rows mean that on January 27 and March 10 trips with service_id of 1 are not running. The final two rows mean that on those same dates trips with service_id of 12 are running. This has the following meaning:

"On 27 January and 10 March, use the Sunday timetable instead of the Monday-Friday timetable."

In Adelaide, these two dates are holidays (Australia Day and Labour Day). It is Adelaide Metro's policy to run their Sunday timetable on public holidays, which is reflected by the above records in their calendar_dates.txt file.

Structuring Services

The case described above is the ideal case for specifying services in a GTFS feed (dates primarily specified in calendar.txt with a handful of exceptions in calendar_dates.txt).

Be aware that there are two other major ways that services are specified in feeds.

  1. Using only calendar_dates.txt and expressly including every single date within the service range. Each of these is included as "service added" (an exception_type value of 1). The following table shows how this might look.
service_iddateexception_type
1201401021
1201401031
11201401041
12201401051
  1. Not using calendar_dates.txt, but creating many records in calendar.txt instead to span various dates. The following table shows how you can represent Monday-Friday from the sample data in this fashion.
service_idmondaytuesdaywednesdaythursdayfridaysaturdaysundaystart_dateend_date
1a11111002014010220140126
holiday110000002014012720140127
1b11111002014012820140309
holiday210000002014031020140310
1c11111002014031120140331

In this example, each holiday has its own row in calendar.txt that runs for a single day only.

Refer to Finding Service IDs to see how to determine services that are running for a given day.

Service Name

There are a number of feeds that specify a column in calendar.txt called service_name. This is used to give a descriptive name to each service. For example, the Sedona Roadrunner in Arizona (https://openmobilitydata.org/p/sedona-roadrunner) has services called "Weekday Service", "Weekend Service" and "New Year's Eve Service".

8. Fare Definitions (fare_attributes.txt & fare_rules.txt)

These files are optional in a GTFS feed, but any rules specified must reference a fare attributes record.

These two files define the types of fares that exist in a system, including their price and transfer information. The attributes of a particular fare exist in fare_attributes.txt, which has the following columns.

FieldRequired?Description
fare_idRequiredA value that uniquely identifies a fare listed in this file. It must only appear once in this file.
priceRequiredThis field specifies the cost of the fare. For instance, if a trip costs $2 USD, then this value should be either 2.00 or 2, and the currency_type should be USD.
currency_typeRequiredThis is the currency code that price is specified in, such as USD for United States Dollar.
payment_methodRequiredThis indicates when the fare is to be paid. 0 means it can be paid on board, while 1 means it must be paid before boarding.
transfersRequiredThe number of transfers that may occur. This must either be empty (unlimited) transfers, or the number of transfers.
transfer_durationOptionalThis is the number of seconds a transfer is valid for.

The following table shows the specification for fare_rules.txt, which defines the rules used to apply a fare to a particular trip.

FieldRequired?Description
fare_idRequiredThis is the ID of the fare that a rule applies to as it appears in fare_attributes.txt.
route_idOptionalThe ID of a route as it appears in routes.txt for which this rule applies. If there are several routes with the same fare attributes, there may be a row in fare_rules.txt for each route.
origin_idOptionalThis value corresponds to a zone_id value from stops.txt. If specified, this means a trip must begin in the given zone in order to qualify for this fare.
destination_idOptionalThis value corresponds to a zone_id value from stops.txt. If specified, this means a trip must begin in the given zone in order to qualify for this fare.
contains_idOptionalThis value corresponds to a zone_id value from stops.txt. If specified, this means a trip must pass through every contains_id zone for the given fare (in other words, several rules may need to be checked).

Note that aside from fare_id, all fields are optional in this file. This means some very complex rules can be made (especially when transfers come into the calculation). The following URL has discussion about different rules and some complex fare examples:

https://code.google.com/p/googletransitdatafeed/wiki/FareExamples

Refer to Calculating Fares for discussion about the algorithm for calculating fares for trips both with and without transfers.

Sample Data

The following data is taken from the TriMet GTFS feed (https://openmobilitydata.org/p/trimet). Firstly, the data from the fare_attributes.txt file.

fare_idpricecurrencypayment_methodtransferstransfer_duration
B2.5USD07200
R2.5USD17200
BR2.5USD07200
RB2.5USD17200
SC1USD10
AT4USD10
VT0USD10

The data from the fare_rules.txt file is shown in the following table.

fare_idroute_idorigin_iddestination_idcontains_id
BBB
RRR
BRBB
BRBR
RBRB
RBRR
SC193
SC194
AT208
VT250

In this sample data, TriMet have named some of their fares the same as the zones specified in stops.txt. In this particular feed, bus stops have a zone_id of B, while rail stops have R.

The fares in this file are as follows:

  • Fare B. If you start at a bus stop (zone_id value of B), you can buy your ticket on board (payment_method of 0). You may transfer an unlimited number (empty transfers value) for 2 hours (transfer_duration of 7200). The cost is $2.50 USD.
  • Fare R. If you start at a rail stop, you must pre-purchase your ticket. You may transfer an unlimited number of times to other rail services for up to 2 hours. The cost is $2.50 USD.

The BR fare describes a trip that begins on a bus then transfers to a rail service (while RB is the opposite). This fare is not be matched if the passenger does not travel on both, as all contains_id values must be matched in order to apply a fare.

The other fares (SC, AT and VT) all apply to their respective route_id values, regardless of start and finish stops. Tickets must be pre-purchased, and transfers are not allowed. The VT fare (which corresponds to TriMet's Vintage Trolley) is free to ride since it has a price of 0.

Assigning Fares to Agencies

One of the extensions available to fare_attributes.txt is to include an agency_id column. This is to limit a specific fare to only routes from the specified agency, in the case where a feed has multiple agencies.

This is useful because there may be two agencies in a feed that define fares with no specific rules (in other words, the fare applies to all trips). If the price differs, then GTFS dictates that the cheapest fare is always applied. Using agency_id means these fares can be differentiated accordingly.

For more information about this extension, refer to the Google Transit GTFS Extensions page at https://support.google.com/transitpartners/answer/2450962.

9. Trip Shapes (shapes.txt)

This file is optional in a GTFS feed.

Each trip in trips.txt can have a shape associated with it. The shapes.txt file defines the points that make up an individual shape in order to plot a trip on a map. Two or more records in shapes.txt with the same shape_id value define a shape.

The amount of data stored in this file can be quite large. In Optimizing Shapes there are some strategies to efficiently reduce the amount of shape data.

FieldRequired?Description
shape_idRequiredAn ID to uniquely identify a shape. Every point for a shape contains the same value.
shape_pt_latRequiredThe latitude for a given point in the range of -90 to 90.
shape_pt_lonRequiredThe longitude for a given point in the range of -180 to 180.
shape_pt_sequenceRequiredA non-negative number that defines the ordering for points in a shape. A value must not be repeated within a single shape.
shape_dist_traveledOptionalThis value represents how far along a shape a particular point exists. This is a distance in a unit such as feet or kilometers. This unit must be the same as that used in stop_times.txt.

Sample Data

The following table shows a portion of a shape from the TriMet GTFS feed. It is a portion of the shape that corresponds to the sample data in the stop_times.txt section.

shape_idshape_pt_latshape_pt_lonshape_pt_sequenceshape_dist_traveled
18532845.52291-122.67737210.0
18532845.522921-122.6773723.7
18532845.522991-122.677432334.0
18532845.522992-122.677246481.5
18532845.523002-122.6765675255.7
18532845.523004-122.6764866276.4
18532845.523007-122.6763867302.0
18532845.523024-122.6753868558.4
18532845.522962-122.675389581.0

In this sample data, the shape_dist_traveled is listed in feet. There is no way to specify in a GTFS feed which units are used for this column -- it could be feet, miles, meters, kilometers. In actual fact, it does not really matter, just as long as the units are the same as in stop_times.txt.

If you need to present a distance to your users (such as how far you need to travel on a bus), you can calculate it instead by adding up the distance between each point and formatting it based on the user's locale settings.

Point Sequences

In most GTFS feeds the shape_pt_sequence value starts at 1 and increments by 1 for every subsequent point. Additionally, points are typically listed in order of their sequence.

You should not rely on these two statements though, as this is not a requirement of GTFS. Many transit agencies have automated systems that export their GTFS from a separate system, which can sometimes result in an unpredictable output format.

For instance, a trip that has stop times listed with the sequences 1, 2, 9, 18, 7, 3 is perfectly valid.

Distance Travelled

The shape_dist_traveled column is used so you can programmatically determine how much of a shape to draw when showing a map to users of your web site or app. If you use techniques in Optimizing Shapes to reduce the file size of shape data, then it becomes difficult to use this value.

Alternatively, you can calculate portions of shapes by determining which point in a shape travels closest to the start and finish points of a trip.

10. Repeating Trips (frequencies.txt)

This file is optional in a GTFS feed.

In some cases a route may repeat a particular stopping pattern every few minutes (picture a subway line that runs every 5 minutes). Rather than including entries in trips.txt and stop_times.txt for every single occurrence, you can include the trip once then define rules for it to repeat for a period of time.

Having a trip repeat only works in the case where the timing between stops remains consistent for all stops. Using frequencies.txt, you use the relative times between stops alongside a calculated starting time for the trip in order to determine the specific stop times.

FieldRequired?Description
trip_idRequiredThe ID of the trip as it appears in trips.txt that is being repeated. A single trip can appear multiple times for different time ranges.
start_timeRequiredThe time at which a given trip starts repeating, in HH:MM:SS format.
end_timeRequiredThe time at which a given trip stops repeating, in HH:MM:SS format.
headway_secsRequiredThe time in seconds between departures from a given stop during the time range.
exact_timesOptionalWhether or not repeating trips should be exactly scheduled. See below for discussion.

Sample Data

The following sample data is taken from Société de transport de Montréal (STM) in Montreal (https://openmobilitydata.org/p/societe-de-transport-de-montreal).

trip_idstart_timeend_timeheadway_secs
13S_13S_F1_1_2_0.2652805:30:0007:25:30630
13S_13S_F1_1_6_0.3416707:25:3008:40:10560
13S_13S_F1_1_10_0.4250008:40:1012:19:00505
13S_13S_F1_1_7_0.5875012:19:0015:00:00460
13S_13S_F1_1_11_0.6687515:00:0018:23:00420
13S_13S_F1_1_5_0.7888918:23:0021:36:35505

Each of the trips listed here have corresponding entries in trips.txt and stop_times.txt (more on that shortly). This data can be interpreted as follows.

  • The first trip runs every 10m 30s from 5:30am until 7:25am.
  • The second trip runs every 9m 20s from 7:25am until 8:40am, and so on.

The following table shows some of the stop times for the first trip (departure_time is omitted here for brevity, since it is identical to arrival_time).

trip_idstop_idarrival_timestop_sequence
13S_13S_F1_1_2_0.265281806:22:001
13S_13S_F1_1_2_0.265281906:22:592
13S_13S_F1_1_2_0.265282006:24:003
13S_13S_F1_1_2_0.265282106:26:004

As this trip runs to the specified frequency, the specific times do not matter. Instead, the differences are used. For the above stop times, there is a 59 second gap between the first and second time, a 61 second gap between the second and third, and a 120 second gap between the third and fourth.

The stop times for the first frequency record (10.5 minutes apart) can be calculated as follows.

  • 05:30:00, 05:30:59, 05:32:00, 05:34:00
  • 05:40:30, 05:41:29, 05:42:30, 05:44:30
  • 05:51:00, 05:51:59, 05:53:00, 05:55:00
  • ...
  • 07:25:30, 07:26:29, 07:27:30, 07:29:30

Specifying Exact Times

In the file definition at the beginning of this chapter there is an optional field called exact_times. It may not be immediately clear what this field means, so to explain it better, consider the frequency definitions in the following table.

trip_idstart_timeend_timeheadway_secsexact_times
T109:00:0010:00:003000
T209:00:0010:00:003001

These two frequencies are the same, with only the exact_times value different. The first (T1) should be presented in a manner such as:

"Between 9 AM and 10 AM this trip departs every 5 minutes."

The second trip (T2) should be presented as follows:

"This trip departs at 9 AM, 9:05 AM, 9:10 AM, ..."

While ultimately the meaning is the same, this difference is used in order to allow agencies to represent their schedules more accurately. Often, schedules that convey to passengers that they will not have to wait more than five minutes do so without having to explicitly list every departure time.

11. Stop Transfers (transfers.txt)

This file is optional in a GTFS feed.

To define how passengers can transfer between routes at specific stops feed providers can include transfers.txt. This does not mean passengers cannot transfer elsewhere, but it does indicate if a transfer is not possible between certain stops, or a minimum time required if transfer is possible.

FieldRequired?Description
from_stop_idRequiredThe ID of stop as it appears in stops.txt where the connection begins. If this references a station, then this rule applies to all stops within the station.
to_stop_idRequiredThe ID of the stop as it appears in stops.txt where the connection between trips ends. If this references a station, then this rule applies to all stops within the station.
transfer_typeRequired0 or blank means the recommended transfer point, 1 means the secondary vehicle will wait for the first, 2 means a minimum amount of time is required, 3 means transfer is not possible.
min_transfer_timeOptionalIf the transfer_type value is 2 then this value must be specified. It indicates the number of seconds required to transfer between the given stops.

It is also possible that records in this file are specified for ticketing reasons. For instance, some train stations are set up so that passengers can transfer between routes without needing to validate their ticket again or buy a transfer. Other stations that are shared between those same routes might not have this open transfer area, thereby requiring you to exit one route fully before buying another ticket to access the second.

Sample Data

The following table shows some sample transfer rules from TriMet in Portland's GTFS feed (https://openmobilitydata.org/p/trimet).

from_stop_idto_stop_idtransfer_typemin_transfer_time
780750200
780776340
780776400

These rules indicate that if you are transferring from a route that visits stop 7807 to any route that visits the other stops (5020, 7634 or 7640), then this is the ideal place to do it.

In other words, if there are other locations along the first route where you could transfer to the second route, then those stops should not be used. These rules say this is the best place to transfer.

Consider the transfer rule in the following table, taken from the New York City Subway GTFS feed (https://openmobilitydata.org/p/mta/79).

from_stop_idto_stop_idtransfer_typemin_transfer_time
1211212180

In this data, the MTA specifies how long it takes to transfer to different platforms within the same station. The stop with ID 121 refers to the 86th St station (as specified in stops.txt). It has a location_type of 1 and two stops within it (121N and 121S). The above transfer rule says that if you need to transfer from 121N to 121S (or vice-versa) then a minimum time of 3 minutes (180 seconds) must be allocated.

If you were to calculate the time taken to transfer using the coordinates of each of these platforms, it would only take a few seconds as they are physically close to each other. In reality though, you must exit one platform then walk around and enter the other platform (often having to use stairs).

12. Feed Information (feed_info.txt)

This file is optional in a GTFS feed.

Feed providers can include additional information about a feed using feed_info.txt. It should only ever have a single row (other than the CSV header row).

FieldRequired?Description
feed_publisher_nameRequiredThe name of the organization that publishes the feed. This may or may not be the same as any agency in agency.txt.
feed_publisher_urlRequiredThe URL of the feed publisher's web site.
feed_langRequiredThis specifies the language used in the feed. If an agency also has a language specified, then the agency's value should override this value.
feed_start_dateOptionalThis value is a date in YYYYMMDD format that asserts the data in this feed is valid from this date. If specified, it typically matches up with the earliest date in calendar.txt or calendar_dates.txt, but if it is earlier, this is explicitly saying there are no services running between this date and the earliest service date.
feed_end_dateOptionalThis value is a date in YYYYMMDD format that asserts the data in this feed is valid until this date. If specified, it typically matches up with the latest date in calendar.txt or calendar_dates.txt, but if it is earlier, this is explicitly saying there are no services running between the latest service date and this date.
feed_versionOptionalA string that indicates the version of this feed. This can be useful to let feed publishers know whether the latest version of their feed has been incorporated.

Sample Data

The following sample data is taken from the GTFS feed of TriMet in Portland (https://openmobilitydata.org/p/trimet).

feed_publisher_namefeed_publisher_urlfeed_langfeed_start_datefeed_end_datefeed_version
TriMethttp://trimet.orgen20140121-20140421

In this example, TriMet do not include the start or end dates, meaning you should derive the dates this feed is active for by the dates in calendar_dates.txt (this particular feed does not have a calendar.txt file).

TriMet use date stamps to indicate the feed version. This feed was published on 21 January 2014 and includes data up until 21 April 2014, so it appears they use the first/last dates as a way to specify their version. Each agency has its own method.

13. Importing a GTFS Feed to SQL

One of the great things about GTFS is that it is already in a format conducive to being used in an SQL database. The presence of various IDs in each of the different files makes it easy to join the tables in order to extract the data you require.

To try this yourself, download GtfsToSql (https://github.com/OpenMobilityData/GtfsToSql). This is a Java command-line application that imports a GTFS feed to an SQLite database. This application also supports PostgreSQL, but the examples used here are for SQLite.

The pre-compiled GtfsToSql Java archive can be downloaded from its GitHub repository at https://github.com/OpenMobilityData/GtfsToSql/tree/master/dist.

To use GtfsToSql, all you need is an extracted GTFS feed. The following instructions demonstrate how you to import the TriMet feed that has been referenced throughout this book.

Firstly, download and extract the feed. The following commands use curl to download the file, then unzip to extract the file to a sub-directory called trimet.

$ curl http://developer.trimet.org/schedule/gtfs.zip > gtfs.zip

$ unzip gtfs.zip -d trimet/

To create an SQLite database from this feed, the following command can be used.

$ java -jar GtfsToSql.jar -s jdbc:sqlite:./db.sqlite -g ./trimet

This may take a minute or two to complete (you will see progress as it imports the feed and then creates indexes), and at the end you will have a GTFS database in a file called db.sqlite. You can then query this database with the command-line sqlite3 tool, as shown in the following example.

$ sqlite3 db.sqlite
sqlite> SELECT * FROM agency;
|TriMet|http://trimet.org|America/Los_Angeles|en|503-238-7433

sqlite> SELECT * FROM routes WHERE route_type = 0;
90|||MAX Red Line||0|http://trimet.org/schedules/r090.htm||
100|||MAX Blue Line||0|http://trimet.org/schedules/r100.htm||
190|||MAX Yellow Line||0|http://trimet.org/schedules/r190.htm||
193||Portland Streetcar|NS Line||0|http://trimet.org/schedules/r193.htm||
194||Portland Streetcar|CL Line||0|http://trimet.org/schedules/r194.htm||
200|||MAX Green Line||0|http://trimet.org/schedules/r200.htm||
250|||Vintage Trolley||0|http://trimet.org/schedules/r250.htm||

The first query above finds all agencies stored in the database, while the second finds all routes marked as Light Rail (route_type of 0).

Note: In the following chapters there are more SQL examples. All of these examples are geared towards running on an SQLite database that has been created in this manner.

All tables in this database match up with the corresponding GTFS filename (so for agency.txt, the table name is agency, while for stop_times.txt the table name is stop_times). The columns in SQL have the same name as the value in the corresponding GTFS file.

Note: All data imported using this tool is stored as text in the database. This means you may need to be careful when querying integer data. For example, ordering stop times by stop_sequence may not produce expected results (for instance, 29 as a string comes before 3). Although it is a performance hit, you can change this behavior by casting the value to integer, such as: ORDER BY stop_sequence + 0. The reason GtfsToSql works in this way is because it is intended as a lightweight tool to be able to quickly query GTFS data. I recommend rolling your own importer to treat data exactly as you need it, especially in conjunction with some of the optimization techniques recommended later in this book.

File Encodings

The GTFS specification does not indicate whether files should be encoded using UTF-8, ISO-8859-1 or otherwise. Since a GTFS feed is not necessarily in English, you must be willing to handle an extended character set.

The GtfsToSql tool introduced above automatically detects the encoding of each file using the juniversalchardet Java library (https://code.google.com/p/juniversalchardet/).

I recommend you take some time looking at the source code of GtfsToSql to further understand this so you are aware of handling encodings correctly if you write your own parser.

Optimizing GTFS Feeds

If you are creating a database that is to be distributed onto a mobile device such as an iPhone or Android phone, then disk space and computational power is at a premium. Even if you are setting up a database to be queried on a server only, then making the database perform as quickly as possible is still important.

In the following chapters are techniques for optimizing GTFS feeds. There are many techniques that can be applied to improve the performance of GTFS, such as:

  • Using integer identifiers rather than string identifiers (for route IDs, trip IDs, stop IDs, etc.) and creating appropriate indexes
  • Removing redundant shape points and encoding shapes
  • Deleting unused data
  • Reusing repeating trip patterns.

Changing the data to use integer IDs makes the greatest improvement to performance, but the other techniques also help significantly.

Depending on your needs, there are other optimizations that can be made to reduce file size and speed up querying of the data, but the ease of implementing them may depend on your database storage system and the programming language used to query the data. The above list is a good starting point.

14. Switching to Integer IDs

There are a number of instances in a GTFS feed where IDs are used, such as to identify routes, trips, stops and shapes. There are no specific guidelines in GTFS as to the type of data or length an ID can be. As such, IDs in some GTFS feeds maybe anywhere up to 30 or 40 characters long.

Using long strings as IDs is extremely inefficient as they make the size of a database much larger than it needs to be, as well as making querying the data much slower.

To demonstrate, consider a GTFS feed where trip IDs are 30 characters long. If there are 10,000 trips, each with an average of 30 stops in stop_times.txt, then the IDs alone take up 9.3 MB of storage. Realistically speaking, you need to index the trip_id field in order to look up a trip's stop times quickly, which uses even more space.

The following SQL statements show how you might represent GTFS without optimizing the identifiers. For brevity, not all fields from the GTFS feed are included here.

CREATE TABLE trips (
  trip_id TEXT,
  route_id TEXT,
  service_id TEXT
);

CREATE INDEX trips_trip_id ON trips (trip_id);

CREATE INDEX trips_route_id ON trips (route_id);

CREATE INDEX trips_service_id ON trips (service_id);

CREATE TABLE stop_times (
  trip_id TEXT,
  stop_id TEXT,
  stop_sequence INTEGER
);

CREATE INDEX stop_times_trip_id ON stop_times (trip_id);

CREATE INDEX stop_times_stop_id on stop_times (stop_id);

If you were to add an integer column to trips called, say, trip_index, then you can reference that value from stop_times instead of trip_id. The following SQL statements show this.

CREATE TABLE trips (
  trip_id TEXT,
  trip_index INTEGER,
  route_id TEXT,
  service_id TEXT
);

CREATE INDEX trips_trip_id ON trips (trip_id);
CREATE INDEX trips_trip_index ON trips (trip_index);
CREATE INDEX trips_route_id ON trips (route_id);
CREATE INDEX trips_service_id ON trips (service_id);

CREATE TABLE stop_times (
  trip_index INTEGER,
  stop_id TEXT,
  stop_sequence INTEGER
);

CREATE INDEX stop_times_trip_index ON stop_times (trip_index);

CREATE INDEX stop_times_stop_id on stop_times (stop_id);

This results in a significant space saving (when you consider how large stop_times can be), as well as being far quicker to look up stop times based on a trip ID. Note that the original trip_id value is retained so it can be referenced if required.

Without adding trip_index, you would use the following query to find stop times given a trip ID.

SELECT * FROM stop_times
  WHERE trip_id = 'SOME_LONG_TRIP_ID'
  ORDER BY stop_sequence;

With the addition of trip_index, you need to first find the record in trips. This can be achieved using the following query. This is a small sacrifice compared to performing string comparison on all stop times.

SELECT * FROM stop_times
  WHERE trip_index = (
    SELECT trip_index FROM trips WHERE trip_id = 'SOME_LONG_TRIP_ID'
  )
  ORDER BY stop_sequence;

You can make the same change for the other IDs in the feed, such as route_id and stop_id. For these columns you still keep (and index) the original values in routes and stops respectively, since you may still need to look up records based on these values.

Note: Even though this book recommends optimizing feeds in this manner, the remainder of examples in this book only use their original IDs, in order to simplify the examples and to ensure compatibility with the GtfsToSql tool introduced previously.

15. Optimizing Shapes

Shape data in a GTFS feed (that is, the records from shapes.txt) represents a large amount of data. There are a number of ways to reduce this data, which can help to:

  • Speed up data retrieval
  • Reduce the amount of data to transmit to app / web site users
  • Speed up rendering of the shape onto a map (such as a native mobile map or a JavaScript map).

Two ways to reduce shape data are as follows:

  • Reducing the number of points in a shape. The shapes included in GTFS are often very precise and include a number of redundant points. Many of these can be removed without a noticeable loss of shape quality using the Douglas-Peucker Algorithm.
  • Encoding all points in a shape into a single value. The Encoded Polyline Algorithm used in the Google Maps JavaScript API can also be used with GTFS shapes. This reduces the amount of storage required and also makes looking up all points in a shape far quicker.

Reducing Points in a Shape

Many of the shapes you find in GTFS feeds are extremely detailed. They often follow the exact curvature of the road and may consist of hundreds or thousands of points for a trip that might have only 30 or 40 stops.

While this level of detail is useful, the sheer amount of data required to be rendered on a map can be a massive performance hit from the perspective of retrieving the data as well as rendering on a map. Realistically, shapes do not need this much detail in order to convey their message to your users.

Consider the following shape from Portland that has been rendered using Google Maps. The total shape consists of 1913 points.

Original Shape

Compare this now to the same shape that has had redundant points removed. The total number of points in this shape is 175, which represents about a 90% reduction.

Reduced Shape

If you look closely, you can see some minor loss of detail, but for the most part, the shapes are almost identical.

This reduction in points can be achieved using the Douglas-Peucker Algorithm. It does so by discarding points that do not deviate significantly between its surrounding points.

The Douglas-Peucker Algorithm works as follows:

  • Begin with the first and last points in the path (A and B). These are always kept.
  • Find the point between the first and last that is furthest away from the line joining the first and last line (the orthogonal distance -- see the figure below).
  • If this point is greater than the allowed distance (the tolerance level), the point is kept (call it X).
  • Repeat this algorithm twice: once using A as the first point and X as the last point, then again using X as the first point and B as the last point.

This algorithm is recursive, and continues until all points have been checked.

Note: The tolerance level determines how aggressively points are removed. A higher tolerance value is less aggressive and discards less data, while a lower tolerance discards more data.

The following diagram shows what orthogonal distance means.

Orthogonal Distance

The following resources provide more information about the Douglas-Peucker Algorithm and describe how to implement it in your own systems:

You can often discard about 80-90% of all shape data before seeing a significant loss of line detail.

Encoding Shape Points

A single entry in shapes.txt corresponds to a single point in a single shape. Each entry includes a shape ID, a latitude and longitude.

Note: The shape_dist_traveled field is also included, but you do not strictly need to use this field (nor the corresponding field in stop_times.txt). The technique described in this section will not work if you intend to use shape_dist_traveled.

This means if you want to look up a shape by its ID, you may need to retrieve several hundreds of rows from a database. Using the Encoded Polyline Algorithm you can change your GTFS database so each shape is represented by a single row in a database. This means the shape can be found much more quickly and much less data needs to be processed to determine the shape.

Consider the following data, taken from TriMet's shapes.txt file. This data represents the first five points of a shape.

shape_idshape_pt_latshape_pt_lonshape_pt_sequenceshape_dist_traveled
18532845.52291 -122.67737210.0
18532845.522921-122.6773723.7
18532845.522991-122.677432334.0
18532845.522992-122.677246481.5
18532845.523002-122.6765675255.7

If you apply the Encoded Polyline Algorithm to this data, the coordinates can be represented using the following string.

eeztGrlwkVAAML?e@AgC

To learn how to arrive at this value, you can read up on the Encoded Polyline Algorithm at https://developers.google.com/maps/documentation/utilities/polylinealgorithm.

Instead of having every single shape point in a single table, you can create a table that has one record per shape. The following SQL statement is a way you could achieve this.

CREATE TABLE shapes (
  shape_id TEXT,
  encoded_shape TEXT
);

The following table shows how this data could be represented in a database.

shape_idencoded_shape
185328eeztGrlwkVAAML?e@AgC

Storing the shape in this manner means you can retrieve an entire shape by looking up only one database row and running it through your decoder.

To further demonstrate how both the encoding and decoding works, try out the polyline utility at https://developers.google.com/maps/documentation/utilities/polylineutility.

You can find implementations for encoding and decoding points for various languages at the following locations:

16. Deleting Unused Data**

Once you have imported a GTFS feed into a database, it is possible for there to be a lot of redundant data. This can ultimately slow down any querying of that data as well as bloating the size of the database. If you are making an app where the GTFS database is queried on the device then disk space and computational time are at a premium, so you must do what you can to reduce resource usage.

The first thing to check for is expired services. You can do this by searching calendar.txt for entries that expire before today's date. Be aware though, you also need to ensure there are no calendar_dates.txt entries overriding these services (a service could have an end_date of, say, 20140110, but also have a calendar_dates.txt entry for 20140131).

Firstly, find service IDs in calendar_dates.txt that are still active using the following query.

SELECT service_id FROM calendar_dates WHERE date >= '20140110';

Note: In order to improve performance of searching for dates, you should import the date field in calendar_dates.txt as an integer, as well as start_date and end_date in calendar.txt.

Any services matched in this query should not be removed. You can then find service IDs in calendar.txt with the following SQL query.

SELECT * FROM calendar WHERE end_date < '20140110'
  AND service_id NOT IN (
    SELECT service_id FROM calendar_dates WHERE date >= '20140110'
  );

Before deleting these services, corresponding trips and stop times must be removed since you need to know the service ID in order to delete a trip. Likewise, stop times must be deleted before trips since you need to know the trip IDs to be removed.

DELETE FROM stop_times WHERE trip_id IN (
   SELECT trip_id FROM trips WHERE service_id IN (
     SELECT service_id FROM calendar WHERE end_date < '20140110'
       AND service_id NOT IN (
         SELECT service_id FROM calendar_dates WHERE date >= '20140110'
       )
    )
  );

Now there may be a series of trips with no stop times. Rather than repeating the above sub-queries, a more thorough way of removing trips is to remove trips with no stop times.

DELETE FROM trips WHERE trip_id NOT IN (
  SELECT DISTINCT trip_id FROM stop_times
);

With all service_id references removed, you can remove the expired rows from calendar.txt using the following SQL query.

DELETE FROM calendar WHERE end_date < '20140110'
  AND service_id NOT IN (
    SELECT DISTINCT service_id FROM calendar_dates WHERE date >= '20140110'
  );

The expired rows in calendar_dates.txt can also be removed, which can be achieved using the following query.

DELETE FROM calendar_dates WHERE date < '20140110';

There may now be some stops that are not used by any trips. These can be removed using the following query.

DELETE FROM stops WHERE stop_id NOT IN (
  SELECT DISTINCT stop_id FROM stop_times
);

Additionally, you can remove unused shapes and routes using the following queries.

DELETE FROM shapes WHERE shape_id NOT IN (
  SELECT DISTINCT shape_id FROM trips
);

DELETE FROM routes WHERE route_id NOT IN (
  SELECT DISTINCT route_id FROM trips
);

There are other potential rows that can be removed (such as records in transfers.txt that reference non-existent stops), but hopefully you get the idea from the previous queries.

17. Searching for Trips**

This section shows you how to search for trips in a GTFS feed based on specified times and stops.

The three scenarios covered are:

  • Finding all stops departing from a given stop after a certain time
  • Finding all stops arriving at a given stop before a certain time
  • Finding all trips between two stops after a given time

The first and second scenarios are the simplest, because they only rely on a single end of each trip. The third scenario is more complex because you have to ensure that each trip returned visits both the start and finish stops.

When searching for trips, the first thing you need to know is which services are running for the given search time.

Finding Service IDs

The first step in searching for trips is to determine which services are running. To begin with, you need to find service IDs for a given date. You then need to handle exceptions accordingly. That is, you need to add service IDs and remove service IDs based on the rules in calendar_dates.txt.

Note: In Scheduling Past Midnight, you were shown how GTFS works with times past midnight. The key takeaway from this is that you have to search for trips for two sets of service IDs. This is covered as this chapter progresses.

In Australia, 27 January 2014 (Monday) was the holiday for Australia Day. This is used as an example to demonstrate how to retrieve service IDs.

Firstly, you need the main set of service IDs for the day. The following SQL query achieves this.

SELECT service_id FROM calendar
  WHERE start_date <= '20140127' AND end_date >= '20140127'
  AND monday = 1;
 
# Result: 1, 2, 6, 9, 18, 871, 7501

Next you need to find service IDs that are to be excluded, as achieved by the following SQL query.

SELECT service_id FROM calendar_dates
  WHERE date = '20140127' AND exception_type = 2;

# Result: 1, 2, 6, 9, 18, 871, 7501

Finally, you need to find service IDs that are to be added. This query is identical to the previous query, except for the different exception_type value.

SELECT service_id FROM calendar_dates
  WHERE date = '20140127' AND exception_type = 1;

# Result: 12, 874, 4303, 7003

You can combine these three queries all into a single query in SQLite using EXCEPT and UNION, as shown in the following SQL query.

SELECT service_id FROM calendar
  WHERE start_date <= '20140127' AND end_date >= '20140127'
  AND monday = 1
  
UNION
  
SELECT service_id FROM calendar_dates
  WHERE date = '20140127' AND exception_type = 1
  
EXCEPT
  
SELECT service_id FROM calendar_dates
  WHERE date = '20140127' AND exception_type = 2;

# Result: 12, 874, 4304, 7003

Now when you search for trips, only trips that have a matching service_id value are included.

Finding Trips Departing a Given Stop

In order to determine the list of services above, a base timestamp on which to search is needed. For the purposes of this example, assume that timestamp is 27 January 2014 at 1 PM (13:00:00 when using GTFS).

This example searches for all services departing from Adelaide Railway Station, which has stop ID 6665 in the Adelaide Metro GTFS feed. To find all matching stop times, the following query can be performed.

SELECT * FROM stop_times
  WHERE stop_id = '6665'
  AND departure_time >= '13:00:00'
  AND pickup_type = 0
  ORDER BY departure_time;

This returns a series of stop times that match the given criteria. The only problem is it does not yet take into account valid service IDs.

Note: This query may also return stop times that are the final stop on a trip, which is not useful for somebody trying to find departures. You may want to modify your database importer to override the final stop time of each trip so its pickup_type has a value of 1 (no pick-up) and its first stop time so it has a drop_off_type of 1 (no drop-off).

To make sure only the correct trips are returned, join stop_times with trips using trip_id, and then include the list of service IDs. For the purposes of this example the service IDs, stop ID and departure time are being hard-coded. You can either embed a sub-query, or include the service IDs via code.

SELECT t.*, st.* FROM stop_times st, trips t
  WHERE st.stop_id = '6665'
  AND st.trip_id = t.trip_id
  AND t.service_id IN ('12', '874', '4304', '7003')
  AND st.departure_time >= '13:00:00'
  AND st.pickup_type = 0
  ORDER BY st.departure_time;

This gives you the final list of stop times matching the desired criteria. You can then decide specifically which data you need to retrieve; you now have the trip_id, meaning you can find all stop times for a given trip if required.

If you need to restrict the results to only those that occur after the starting stop, you can retrieve stop times with only a stop_sequence larger than that of the stop time returned in the above query.

Finding Trips Arriving at a Given Stop

In order to find the trips arriving at a given stop before a specified time, it is just a matter of making slight modifications to the above query. Firstly, check the arrival_time instead of departure_time. Also, check the drop_off_type value instead of pickup_type.

SELECT t.*, st.* FROM stop_times st, trips t
  WHERE st.stop_id = '6665'
  AND st.trip_id = t.trip_id
  AND t.service_id IN ('12', '874', '4304', '7003')
  AND st.arrival_time <= '13:00:00'
  AND st.drop_off_type = 0
  ORDER BY st.arrival_time DESC;

For this particular data set, there are trips from four different routes returned. If you want to restrict this to a particular route, you can filter on the t.route_id value.

SELECT t.*, st.* FROM stop_times st, trips t
  WHERE st.stop_id = '6665'
  AND st.trip_id = t.trip_id
  AND t.service_id IN ('12', '874', '4304', '7003')
  AND t.route_id = 'BEL'
  AND st.arrival_time <= '13:00:00'
  AND st.drop_off_type = 0
  ORDER BY st.arrival_time DESC;

Performance of Searching Text-Based Times

These examples search using text-based arrival/departure times (such as 13:00:00). This works because the GTFS specification mandates that all times are HH:MM:SS format (although H:MM:SS is allowed for times earlier than 10 AM).

Doing this kind of comparison (especially if you are scanning millions of rows) is quite slow and expensive. It is more efficient to convert all times stored in the database to integers that represent the number of seconds since midnight.

Note: The GTFS specification states that arrival and departure times are "noon minus 12 hours" in order to account for daylight savings time. This is effectively midnight, except for the days that daylight savings starts or finishes.

In order to achieve this, you can convert the text-based time to an integer with H * 3600 + M * 60 + S. For example, 13:35:21 can be converted using the following steps.

  (13 * 3600) + (35 * 60) + (21)
= 46800 + 2100 + 21
= 48921

You can then convert back to hours, minutes and seconds in order to generate timestamps in your application as shown in the following algorithm.

H = floor( 48921 / 3600 )
  = floor( 13.59 )
  = 13
  
M = floor( 48921 / 60 ) % 60
  = floor( 815.35 ) % 60
  = 815 % 60
  = 35
  
S = 48921 % 60
  = 21

Finding Trips Between Two Stops

Now that you know how to look up a trip from or to a given stop, the previous query can be expanded so both the start and finish stop are specified. The following example finds trips that depart after 1 PM. The search only returns trips departing from Adelaide Railway Station (stop ID 6665) are shown. Additionally, only trips that then visit Blackwood Railway Station (stop IDs 6670 and 101484) are included.

In order to achieve this, the following changes must be made to the previous examples.

  • Join against stop_times twice. Once for the departure stop time and once for the arrival stop time.
  • Allow for multiple stop IDs at one end. The destination in this example has two platforms, so you need to check both of them.
  • **Ensure the departure time is earlier than the arrival time. **Otherwise trips heading in the opposite direction may also be returned.

The following query demonstrates how this is achieved.

SELECT t.*, st1.*, st2.*
  FROM trips t, stop_times st1, stop_times st2
  WHERE st1.trip_id = t.trip_id
  AND st2.trip_id = t.trip_id
  AND st1.stop_id = '6665'
  AND st2.stop_id IN ('6670', '101484')
  AND t.service_id IN ('12', '874', '4304', '7003')
  AND st1.departure_time >= '13:00:00'
  AND st1.pickup_type = 0
  AND st2.drop_off_type = 0
  AND st1.departure_time < st2.arrival_time
  ORDER BY st1.departure_time;

In this example, the table alias st1 is used for the departure stop time. Once again, the stop ID must match, as well as the departure time and pick-up type.

For the arrival stop time the alias st2 is used. This table also joins the trips table using trip_id. Since the destination has multiple stop IDs, the SQL IN construct is used. The arrival time is not important in this example, so only the departure is checked.

The final thing to check is that the departure occurs before the arrival. If you do not perform this step, then trips traveling in the opposite direction may also be returned.

Note: Technically, there may be multiple results returned for the same trip. For some transit agencies, a single trip may visit the a stop more than once. If this is the case, you should also check the trip duration (arrival time minus departure time) and use the shortest trip when the same trip is returned multiple times.

Accounting for Midnight

As discussed previously, GTFS has the ability for the trips to depart or arrive after midnight for a given service day without having to specify it as part of the next service day. Consequently, while the queries above are correct, they do not necessarily paint the full picture.

In reality, when performing a trip search, you need to take into account trips that have wrapped times (for instance, where 12:30 AM is specified as 24:30:00). If you want to find trips that depart after 12:30 AM on a given day, you need to check for trips departing after 00:30:00 on that day, as well as for trips departing at 24:30:00 on the previous day.

This means that for each trip search you are left with two sets of trips, which you must then merge and present as appropriate.

Note: In reality, agencies generally do not have trips that overlap from multiple service days, so technically you often only need one query (for example, a train service might end on 12:30 AM then restart on the next service day at 4:30 AM). If your app / web site only uses a single feed where you can tune your queries manually based on how the agency operates, then you can get away with only querying a single service day. On the other hand, if you are building a scalable system that works with data from many agencies, then you need to check both days.

To demonstrate how this works in practice, the following example searches for all trips that depart after 12:30:00 AM on 14 March 2014. The examples earlier in this chapter showed how to find the service IDs for a given date. To account for midnight, service IDs for both March 14 (the "main" service date) and March 13 (the overlapping date) need to be determined.

Assume that March 13 has a service ID of C1 and March 14 has a service ID of C2. First you need to find the departures for March 14, as shown in the following query.

SELECT t.*, st.* FROM stop_times st, trips t
  WHERE st.stop_id = 'S1'
  AND st.trip_id = t.trip_id
  AND t.service_id IN ('C1')
  AND st.departure_time >= '00:30:00'
  AND st.pickup_type = 0
  ORDER BY st.departure_time;

The resultant list needs to be combined with the trips that depart after midnight from the March 13 service. To check this, it is just a matter of swapping in the right service IDs, then adding 24 hours to the search time.

SELECT t.*, st.* FROM stop_times st, trips t
  WHERE st.stop_id = 'S1'
  AND st.trip_id = t.trip_id
  AND t.service_id IN ('C2')
  AND st.departure_time >= '24:30:00'
  AND st.pickup_type = 0
  ORDER BY st.departure_time;

In order to get your final list of trips you must combine the results from both of these queries. If you are generating complete timestamps in order to present the options to your users, just remember to account for the results from the second query being 24 hours later.

18. Working With Trip Blocks**

One of the more complex aspects of GTFS is how to properly use the block_id field in trips.txt. The concept is simple, but incorporating this information in a manner that is simple to understand for a passenger can be more difficult.

A single vehicle (such as a bus or train) completes multiple trips in a single service day. For instance, once a bus completes its trip (Trip X) from Location A to Location B, it then begins another trip (Trip Y) from Location B to Location C. It then completes a final trip (Trip Z) from Location C back to Location A.

If a passenger boards in the middle of Trip X and is allowed to stay on the bus until the end of the Trip Y, then this should be represented in GTFS by giving Trips X, Y, Z the same block_id value.

The following diagram shows a vehicle that performs two trips, completing opposite directions of the same route. Often a single vehicle will do many more than just two trips in a day.

Block

Note: When trips in a block are represented in stop_times.txt, the final stop of a trip and the first stop of the subsequent trip must both be included, even though they are typically at the same stop (and often the same arrival & departure time). The final stop of the first trip is drop-off only, while the first stop of the next trip is pick-up only.

In the following diagram, the vehicle completes three trips, each for separate routes. As in the previous diagram, the vehicle will likely complete more trips over a single day.

Block

These two diagrams show the difficulties that can arise when trying to calculate trips across multiple blocks. The first diagram is repeated below, this time with stops marked so trips between two locations can be found.

Block

Consider the scenario where you want to retrieve all trip times from Stop S2 to Stop S1. Since the vehicle gets to Stop S3 then turns around and heads back to S1, the trip search returns two options:

  1. Board at S2 on the upper trip, travel via S3, then get to S1.
  2. Board at S2 on the lower trip, travel directly to S1.

The second option is a subset of the first option, which means it is a far shorter trip (plus it avoids the passenger getting annoyed from passing their starting point again). To determine which option to use, you can use the trip duration.

Now consider the other type of block formation, where a vehicle completes subsequent trips from different routes.

Block

If a passenger wants to travel from Stop S2 to S4, you will not get the situation where the vehicle travels past the same location twice. However, it is also possible for a passenger to travel from Stop S2 to Stop S6 without ever having to change vehicles.

Referring back to the previous chapter about performing trip searches, it is relatively straightforward to account for blocks in these searches. Using the query to find trips between two stops as a reference, the following changes need to be made:

  • Instead of joining both occurrences of stop_times to the trips table, you now need two occurrences of trips.
  • Each stop_times table occurrence is joined to a separate trips table occurrence.
  • The two occurrences of trips are joined on the block_id value (if it is available).

The following query shows how to find trips that start at stop S2 and finish at stop S4, departing after 1 PM on the day with service ID C1.

SELECT t1.*, t2.*, st1.*, st2.*
  FROM trips t1, trips t2, stop_times st1, stop_times st2
  WHERE st1.trip_id = t1.trip_id
  AND st2.trip_id = t2.trip_id
  AND st1.stop_id = 'S2'
  AND st2.stop_id = 'S4'
  AND t1.service_id = 'C1'
  AND (
    t1.trip_id = t2.trip_id
    OR (
      LENGTH(t1.block_id) > 0 AND t1.block_id = t2.block_id
    )
  )
  AND st1.departure_time >= '13:00:00'
  AND st1.pickup_type = 0
  AND st2.drop_off_type = 0
  AND st1.departure_time < st2.arrival_time
  ORDER BY st1.departure_time;

Since block_id may be unpopulated, the query joins on both the trip_id and the block_id.

Note: An alternative is to guarantee every single trip has a block_id value when importing -- even if some blocks only consist of a single trip. If you can guarantee this condition, then you can simplify this query by using t1.block_id = t2.block_id.

19. Calculating Fares**

In order to calculate a fare for a trip in GTFS, you must use data from fare_attributes.txt and fare_rules.txt. It is relatively straightforward to calculate the cost of a single trip (that is, boarding a vehicle, traveling for a number of stops, then disembarking), but it becomes much more complicated when you need to take into account multiple trip segments (that is, one or more transfers).

Note: As it stands, many feed providers do not include fare information. This is because many systems have a unique set of rules that cannot be modelled with the current structure of fares in GTFS. Additionally, it is not possible to determine different classes of pricing (such as having a separate price for adults and children). For the purposes of this chapter, these limitations are ignored.

For more discussion on how fares work in GTFS, refer to Fare Definitions (fare_attributes.txt & fare_rules.txt).

This chapter first shows you how to calculate fares for a single trip, then how to account for transfers and for multiple trips.

Calculating a Single Trip Fare

Much of the logic used when calculating fares requires knowledge of the zones used in a trip.

Note: A zone is a physical area within a transit system that contains a series of stops. They are used to group trip pricing into key areas in a system. Some transit systems do not work like this (for instance, they may measure the physical distance travelled rather than specific stops), which is one reason why the GTFS fares model does not work in all cases.

A zone is defined in GTFS by the zone_id column in stops.txt. A single stop can only belong to one zone.

Fares are matched to a trip using a combination of any of the following:

  • The route of the trip
  • The zone of the stop the passenger boards from
  • The zone of the stop the passenger disembarks
  • The zone(s) of any stops on the trip that are passed while the passenger is on board.

Consider the following simplified data set that may appear in stops.txt and stop_times.txt. Assume for this example that the trip T1 belongs to a route with an ID of R1.

stop_id,zone_id
S1,Z1
S2,Z1
S3,Z2
S4,Z3

trip_id,stop_id,stop_sequence
T1,S1,1
T1,S2,2
T1,S3,3
T1,S4,4

If a passenger travels from stop S1 to stop S4, then their starting zone is Z1, their finishing zone is Z3, and the zones they pass through are Z1, Z2 and Z3.

Note: When calculating fares, the start and finish zones are also included in the zones passed through, so in this example you Z3 is also considered as a zone that the trip passes through.

Using this data, you can now calculate matching fares. To do so, you need to find all fares that match either of the following:

  • Fares that have no associated rules.
  • Fares that have rules that match the specified trip. If a fare defines multiple zones that must be passed through (using contains_id), then all zones must be matched.

If multiple fares qualify for a trip, then the cheapest fare is the one to use.

Finding Fares With No Rules

This is the simplest use-case for fares. You can find all matching fares with the following SQL.

SELECT * FROM fare_attributes WHERE fare_id NOT IN (
  SELECT DISTINCT fare_id FROM fare_rules
);

If a feed only has fare_attributes.txt records with no rules, then the difference between the fares is in the transfer rules. This section only covers calculating fares for a single trip with no transfers, so for now you can just select the cheapest fare using the following SQL.

SELECT * FROM fare_attributes WHERE fare_id NOT IN (
    SELECT DISTINCT fare_id FROM fare_rules
  )
  ORDER BY price + 0 LIMIT 1;

Note: You still need to check for fares with one or more rules in order to find the cheapest price. Also, 0 is added in this query in order to cast a string to a number. When you roll your own importer you should instead import this as a numerical value.

Finding Fares With Matched Rules

Next you must check against specific rules for a fare. In order to do this, you need the starting zone, finishing zone, and all zones passed through (including the start and finish zones).

Referring back to the previous example, if a trip starts at Z1, passes through Z2 and finishes at Z3, you can find fare candidates (that is, trips that may match), using the following SQL query.

SELECT * FROM fare_attributes WHERE fare_id IN (
    SELECT fare_id FROM fare_rules
      WHERE (LENGTH(route_id) = 0 OR route_id = 'R1')
      AND (LENGTH(origin_id) = 0 OR origin_id = 'Z1')
      AND (LENGTH(destination_id) = 0 OR destination_id = 'Z3')
      AND (LENGTH(contains_id) = 0 OR contains_id IN ('Z1', 'Z2', 'Z3')
    )
  );

This returns a list of fares that may qualify for the given trip. As some fares have multiple rules, all must be checked. The algorithm to represent this is as follows.

fares = [ result from above query ]

qualifyingFares = [ ]

for (fare in fares) {
  if (qualifies(fare))
    qualifyingFares.add(fare)
}

allFares = qualifyFares + faresWithNoRules

passengerFare = cheapest(allFares)

As shown on the final two lines, once you have the list of qualifying fares, you can combine these with fares that have no rules (from the previous section) and then determine the cheapest fare.

First though, you must determine if a fare with rules qualifies for the given trip. If a fare specifies zones that must be passed through, then all rules must be matched.

Note: If a particular rule specifies a different route, start, or finish than the one you are checking, you do not need to ensure the contains_id matches, since this rule no longer applies. You still need to check the other rules for this fare.

The algorithm needs to build up a list of zone IDs from the fare rules in order to check against the trip. Once this has been done, you need to check that every zone ID collected from the rules is contained in the trip's list of zones.

qualifies(fare, routeId, originId, destinationId, containsIds) {

  fareContains = [ ]
  
  for (rule in fare.rules) {
    if (rule.contains.length == 0)
      continue
      
    if (rule.route.length > 0 AND rule.route != routeId)
      continue
      
    if (rule.origin.length > 0 AND rule.origin != originId)
      continue
      
    if (rule.desination.length > 0 AND rule.destination != destinationId)
      continue
      
    fareContains.add(rule.containsId);
 }
 
 if (fareContains.size == 0)
   return YES
   
 if (containIds HAS EVERY ELEMENT IN fareContains)
   return YES
 else 
   return NO
}

This algorithm achieves the following:

  • Only rules that have a value for contains_id are relevant. Rules that do not have this value fall through and should be considered as qualified.
  • If the route is specified but not equal to the one being checked, it is safe to ignore the rule's contains_id. If the route is empty or equal, the loop iteration can continue.
  • Check for the origin_id and destination_id in the same manner as route_id.
  • If the route, origin and destination all qualify then store the contains_id so it can be checked after the loop.

The algorithm returns yes if the fare qualifies, meaning you can save it as a qualifying fare. You can then return the cheapest qualifying fare to the user.

Calculating Trips With Transfers

Once you introduce transfers, fare calculation becomes more complicated. A "trip with a transfer" is considered to be a trip where the passenger boards a vehicle, disembarks, and then gets on another vehicle. For example:

  • Travel on trip T1 from Stop S1 to Stop S2
  • Walk from Stop S2 to Stop S3
  • Travel on trip T2 from Stop S3 to Stop S4.

In order to calculate the total fare for a trip with transfers, the following algorithm is used:

  1. Retrieve list of qualifying fares for each trip individually
  2. Create a list of every fare combination possible
  3. Loop over all combinations and find the total cost
  4. Return the lowest cost from Step 3.

Step 1 was covered in Calculating a Single Trip Fare, but you must skip the final step of finding the cheapest fare. This is because the cheapest fare may change depending on subsequent transfers. Instead, this step is performed once the cheapest combination is determined.

To demonstrate Step 2, consider the following example:

  • The trip on T1 from S1 to S2 yields the following qualifying fares: F1, F2.
  • The subsequent trip on T2 from S3 to S4 yields the following qualifying fares: F3, F4.

Generating every combination of these fares yields the following possibilities:

  • F1 + F3
  • F1 + F4
  • F2 + F3
  • F2 + F4.

Step 3 can now be performed, which involves finding the total cost for each combinations. As you need to take into account the possibility of timed transfers (according to the data stored in fare_attributes.txt), you also need to know about the times of these trips.

The following algorithm can be used to calculate the total cost using transfer rules. In this example, you would call this function once for each fare combination.

function totalCost(fares) {
  total = 0
 
  for (fare in fares) {
    freeTransfer = NO
 
    if (previousFare ALLOWS TRANSFERS) {
      if (HAS ENOUGH TRANSFERS REMAINING) {
        if (TRANSFER NOT EXPIRED) {
          freeTransfer = YES
        }
      }
    }
 
    if (!freeTransfer)
      total = total + fare.price;
     
    previousFare = fare;
  }
 
  return total;
}

Once all combinations have called the totalCost algorithm, you will have a price for each trip. You can then return the lowest price as the final price for the trip.

20. Trip Patterns**

In a GTFS feed, a route typically has multiple trips that start and finish at the same stops. If you are looking to reduce the size of the data stored, then converting data from stop_times.txt into a series of reusable patterns is an excellent way to do so.

For two trips to share a common pattern, the following must hold true:

  • The stops visited and the order in which they are visited must be the same
  • The time differences between each stop must be the same.

The following table shows some fictional trips to demonstrate this.

StopTrip 1Trip 2Trip 3
S110:00:0010:10:0010:20:00
S210:02:0010:13:0010:22:00
S310:05:0010:15:0010:25:00
S410:06:0010:18:0010:26:00
S510:10:0010:21:0010:30:00

In a GTFS feed, this would correspond to 15 records in stop_times.txt. If you look more closely though, you can see the trips are very similar. The following table shows the differences between each stop time, instead of the actual time.

StopTrip 1Trip 2Trip 3
S100:00:0000:00:0000:00:00
S200:02:00 (+2m)00:03:00 (+3m)00:02:00 (+2m)
S300:05:00 (+5m)00:05:00 (+5m)00:05:00 (+5m)
S400:06:00 (+6m)00:08:00 (+8m)00:06:00 (+6m)
S500:10:00 (+10m)00:11:00 (+11m)00:10:00 (+10m)

You can see from this table that the first and third trip, although they start at different times, have the same offsets between stops (as well as stopping at identical stops).

Instead of using a table to store stop times, you can store patterns. By storing the ID of the pattern with each trip, you can reduce the list of stop times in this example from 15 to 10. As only time offsets are stored for each patterns, the trip starting time also needs to be saved with each trip.

You could use SQL such as the following to model this.

CREATE TABLE trips (
  trip_id TEXT,
  pattern_id INTEGER,
  start_time TEXT,
  start_time_secs INTEGER
);

CREATE TABLE patterns (
  pattern_id INTEGER,
  stop_id TEXT,
  time_offset INTEGER,
  stop_sequence INTEGER
);

The data you would store for trips in this example is shown in the following table.

trip_idpattern_idstart_timestart_time_secs
T1110:00:0036000
T2210:10:0036600
T3110:20:0037200

Note: The above table includes start_time_secs, which is an integer value representing the number of seconds since the day started. Using the hour, minutes and seconds in start_time, this value is H * 3600 + M * 60 + S.

In the patterns table, you would store data as in the following table.

pattern_idstop_idtime_offsetstop_sequence
1S101
1S21202
1S33003
1S43604
1S56005
2S101
2S21802
2S43003
2S54804
2S66605

As you can see, this represents an easy way to significantly reduce the amount of data stored. You could have tens or hundreds of trips each sharing the same pattern. When you scale this to the entire feed, this could reduce, say, 3 million records to about 200,000.

Note: This is a somewhat simplified example, as there is other data available in stop_times.txt (such as separate arrival/departure times, drop-off type and pick-up type). You should take all of this data into account when determining how to allocate patterns.

Updating Trip Searches

Changing your model to reuse patterns instead of storing every stop time means your data lookup routines must also be changed.

For example, to find all stop times for a given trip, you must now find the pattern using the following SQL query.

SELECT * FROM patterns
  WHERE pattern_id = (SELECT pattern_id FROM trips WHERE trip_id = 'YOUR_TRIP_ID')
  ORDER BY stop_sequence;

If you want to determine the arrival/departure time, you must add the offset stored for the pattern record to the starting time stored with the trip. This involves joining the tables and adding time_offset to start_time_secs, as shown in the following query.

SELECT t.start_time_secs + p.time_offset, p.stop_id
  FROM patterns p, trips t
  WHERE p.pattern_id = t.pattern_id
  AND t.trip_id = 'YOUR_TRIP_ID'
  ORDER BY p.stop_sequence;

Other Data Reduction Methods

There are other ways you can reduce the amount of data, such as only using patterns to store the stops (and not timing offsets), and then storing the timings with each trip record. A technique such as this further reduces the size of the database, but the trade-off is that querying the data becomes slightly more complex.

Hopefully you can see that by using the method described in this chapter there are a number of ways to be creative with GTFS data, and that you must make decisions when it comes to speed, size, and ease of querying data.

Conclusion

Thanks for reading The Definitive Guide to GTFS. While there are many techniques that may take some time to comprehend in this book, the content should bring you up to speed with GTFS quickly.

At first glance, GTFS appears to be very simple, but there are a number of "gotchas" which are not immediately apparent until you have spent significant time working with a range of feeds.

The key takeaways from this book are:

  • How GTFS feeds are structured
  • How to import a GTFS feed to SQL and perform queries
  • How to search for trips, handle blocks and calculate fares
  • How to optimize an SQL database to minimize resource usage on mobile devices.

If you have enjoyed this book, please share it or feel free to contribution improvements or changes as necessary.

Quentin Zervaas February, 2014