aboutsummaryrefslogtreecommitdiff
path: root/gtfs-book/ch-13-importing-to-sql.md
diff options
context:
space:
mode:
Diffstat (limited to 'gtfs-book/ch-13-importing-to-sql.md')
-rw-r--r--gtfs-book/ch-13-importing-to-sql.md125
1 files changed, 125 insertions, 0 deletions
diff --git a/gtfs-book/ch-13-importing-to-sql.md b/gtfs-book/ch-13-importing-to-sql.md
new file mode 100644
index 0000000..e96517c
--- /dev/null
+++ b/gtfs-book/ch-13-importing-to-sql.md
@@ -0,0 +1,125 @@
+## 13. Importing a GTFS Feed to SQL
+
+One of the great things about GTFS is that it is already in a format
+conducive to being used in an SQL database. The presence of various IDs
+in each of the different files makes it easy to join the tables in order
+to extract the data you require.
+
+To try this yourself, download `GtfsToSql`
+(<https://github.com/OpenMobilityData/GtfsToSql>). This is a Java
+command-line application that imports a GTFS feed to an SQLite database.
+This application also supports PostgreSQL, but the examples used here
+are for SQLite.
+
+The pre-compiled `GtfsToSql` Java archive can be downloaded from its
+GitHub repository at
+<https://github.com/OpenMobilityData/GtfsToSql/tree/master/dist>.
+
+To use `GtfsToSql`, all you need is an extracted GTFS feed. The
+following instructions demonstrate how you to import the TriMet feed
+that has been referenced throughout this book.
+
+Firstly, download and extract the feed. The following commands use curl
+to download the file, then unzip to extract the file to a sub-directory
+called `trimet`.
+
+```
+$ curl http://developer.trimet.org/schedule/gtfs.zip > gtfs.zip
+
+$ unzip gtfs.zip -d trimet/
+```
+
+To create an SQLite database from this feed, the following command can
+be used.
+
+```
+$ java -jar GtfsToSql.jar -s jdbc:sqlite:./db.sqlite -g ./trimet
+```
+
+This may take a minute or two to complete (you will see progress as it
+imports the feed and then creates indexes), and at the end you will have
+a GTFS database in a file called `db.sqlite`. You can then query this
+database with the command-line `sqlite3` tool, as shown in the
+following example.
+
+```
+$ sqlite3 db.sqlite
+sqlite> SELECT * FROM agency;
+|TriMet|http://trimet.org|America/Los_Angeles|en|503-238-7433
+
+sqlite> SELECT * FROM routes WHERE route_type = 0;
+90|||MAX Red Line||0|http://trimet.org/schedules/r090.htm||
+100|||MAX Blue Line||0|http://trimet.org/schedules/r100.htm||
+190|||MAX Yellow Line||0|http://trimet.org/schedules/r190.htm||
+193||Portland Streetcar|NS Line||0|http://trimet.org/schedules/r193.htm||
+194||Portland Streetcar|CL Line||0|http://trimet.org/schedules/r194.htm||
+200|||MAX Green Line||0|http://trimet.org/schedules/r200.htm||
+250|||Vintage Trolley||0|http://trimet.org/schedules/r250.htm||
+```
+
+The first query above finds all agencies stored in the database, while
+the second finds all routes marked as Light Rail (`route_type` of
+`0`).
+
+**Note: In the following chapters there are more SQL examples. All of
+these examples are geared towards running on an SQLite database that has
+been created in this manner.**
+
+All tables in this database match up with the corresponding GTFS
+filename (so for `agency.txt`, the table name is `agency`, while for
+`stop_times.txt` the table name is `stop_times`). The columns in SQL
+have the same name as the value in the corresponding GTFS file.
+
+***Note:** All data imported using this tool is stored as text in the
+database. This means you may need to be careful when querying integer
+data. For example, ordering stop times by stop_sequence may not produce
+expected results (for instance, 29 as a string comes before 3). Although
+it is a performance hit, you can change this behavior by casting the
+value to integer, such as: ORDER BY stop_sequence + 0. The reason
+`GtfsToSql` works in this way is because it is intended as a lightweight
+tool to be able to quickly query GTFS data. I recommend rolling your own
+importer to treat data exactly as you need it, especially in conjunction
+with some of the optimization techniques recommended later in this book.*
+
+### File Encodings
+
+The GTFS specification does not indicate whether files should be encoded
+using UTF-8, ISO-8859-1 or otherwise. Since a GTFS feed is not
+necessarily in English, you must be willing to handle an extended
+character set.
+
+The GtfsToSql tool introduced above automatically detects the encoding
+of each file using the juniversalchardet Java library
+(<https://code.google.com/p/juniversalchardet/>).
+
+I recommend you take some time looking at the source code of GtfsToSql
+to further understand this so you are aware of handling encodings
+correctly if you write your own parser.
+
+### Optimizing GTFS Feeds
+
+If you are creating a database that is to be distributed onto a mobile
+device such as an iPhone or Android phone, then disk space and
+computational power is at a premium. Even if you are setting up a
+database to be queried on a server only, then making the database
+perform as quickly as possible is still important.
+
+In the following chapters are techniques for optimizing GTFS feeds.
+There are many techniques that can be applied to improve the performance
+of GTFS, such as:
+
+* Using integer identifiers rather than string identifiers (for route
+ IDs, trip IDs, stop IDs, etc.) and creating appropriate indexes
+* Removing redundant shape points and encoding shapes
+* Deleting unused data
+* Reusing repeating trip patterns.
+
+Changing the data to use integer IDs makes the greatest improvement to
+performance, but the other techniques also help significantly.
+
+Depending on your needs, there are other optimizations that can be made
+to reduce file size and speed up querying of the data, but the ease of
+implementing them may depend on your database storage system and the
+programming language used to query the data. The above list is a good
+starting point.
+