gtfs-book/ch-13-importing-to-sql.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125

## 13. Importing a GTFS Feed to SQL

One of the great things about GTFS is that it is already in a format
conducive to being used in an SQL database. The presence of various IDs
in each of the different files makes it easy to join the tables in order
to extract the data you require.

To try this yourself, download `GtfsToSql`
(<https://github.com/OpenMobilityData/GtfsToSql>). This is a Java
command-line application that imports a GTFS feed to an SQLite database.
This application also supports PostgreSQL, but the examples used here
are for SQLite.

The pre-compiled `GtfsToSql` Java archive can be downloaded from its
GitHub repository at
<https://github.com/OpenMobilityData/GtfsToSql/tree/master/dist>.

To use `GtfsToSql`, all you need is an extracted GTFS feed. The
following instructions demonstrate how you to import the TriMet feed
that has been referenced throughout this book.

Firstly, download and extract the feed. The following commands use curl
to download the file, then unzip to extract the file to a sub-directory
called `trimet`.

```
$ curl http://developer.trimet.org/schedule/gtfs.zip > gtfs.zip

$ unzip gtfs.zip -d trimet/
```

To create an SQLite database from this feed, the following command can
be used.

```
$ java -jar GtfsToSql.jar -s jdbc:sqlite:./db.sqlite -g ./trimet
```

This may take a minute or two to complete (you will see progress as it
imports the feed and then creates indexes), and at the end you will have
a GTFS database in a file called `db.sqlite`. You can then query this
database with the command-line `sqlite3` tool, as shown in the
following example.

```
$ sqlite3 db.sqlite
sqlite> SELECT * FROM agency;
|TriMet|http://trimet.org|America/Los_Angeles|en|503-238-7433

sqlite> SELECT * FROM routes WHERE route_type = 0;
90|||MAX Red Line||0|http://trimet.org/schedules/r090.htm||
100|||MAX Blue Line||0|http://trimet.org/schedules/r100.htm||
190|||MAX Yellow Line||0|http://trimet.org/schedules/r190.htm||
193||Portland Streetcar|NS Line||0|http://trimet.org/schedules/r193.htm||
194||Portland Streetcar|CL Line||0|http://trimet.org/schedules/r194.htm||
200|||MAX Green Line||0|http://trimet.org/schedules/r200.htm||
250|||Vintage Trolley||0|http://trimet.org/schedules/r250.htm||
```

The first query above finds all agencies stored in the database, while
the second finds all routes marked as Light Rail (`route_type` of
`0`).

**Note: In the following chapters there are more SQL examples. All of
these examples are geared towards running on an SQLite database that has
been created in this manner.**

All tables in this database match up with the corresponding GTFS
filename (so for `agency.txt`, the table name is `agency`, while for
`stop_times.txt` the table name is `stop_times`). The columns in SQL
have the same name as the value in the corresponding GTFS file.

***Note:** All data imported using this tool is stored as text in the
database. This means you may need to be careful when querying integer
data. For example, ordering stop times by stop_sequence may not produce
expected results (for instance, 29 as a string comes before 3). Although
it is a performance hit, you can change this behavior by casting the
value to integer, such as: ORDER BY stop_sequence + 0. The reason
`GtfsToSql` works in this way is because it is intended as a lightweight
tool to be able to quickly query GTFS data. I recommend rolling your own
importer to treat data exactly as you need it, especially in conjunction
with some of the optimization techniques recommended later in this book.*

### File Encodings

The GTFS specification does not indicate whether files should be encoded
using UTF-8, ISO-8859-1 or otherwise. Since a GTFS feed is not
necessarily in English, you must be willing to handle an extended
character set.

The GtfsToSql tool introduced above automatically detects the encoding
of each file using the juniversalchardet Java library
(<https://code.google.com/p/juniversalchardet/>).

I recommend you take some time looking at the source code of GtfsToSql
to further understand this so you are aware of handling encodings
correctly if you write your own parser.

### Optimizing GTFS Feeds

If you are creating a database that is to be distributed onto a mobile
device such as an iPhone or Android phone, then disk space and
computational power is at a premium. Even if you are setting up a
database to be queried on a server only, then making the database
perform as quickly as possible is still important.

In the following chapters are techniques for optimizing GTFS feeds.
There are many techniques that can be applied to improve the performance
of GTFS, such as:

* Using integer identifiers rather than string identifiers (for route
  IDs, trip IDs, stop IDs, etc.) and creating appropriate indexes
* Removing redundant shape points and encoding shapes
* Deleting unused data
* Reusing repeating trip patterns.

Changing the data to use integer IDs makes the greatest improvement to
performance, but the other techniques also help significantly.

Depending on your needs, there are other optimizations that can be made
to reduce file size and speed up querying of the data, but the ease of
implementing them may depend on your database storage system and the
programming language used to query the data. The above list is a good
starting point.