gtfs-book/ch-15-optimizing-shapes.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

## 15. Optimizing Shapes

Shape data in a GTFS feed (that is, the records from `shapes.txt`)
represents a large amount of data. There are a number of ways to reduce
this data, which can help to:

* Speed up data retrieval
* Reduce the amount of data to transmit to app / web site users
* Speed up rendering of the shape onto a map (such as a native mobile
  map or a JavaScript map).

Two ways to reduce shape data are as follows:

* **Reducing the number of points in a shape.** The shapes included in
  GTFS are often very precise and include a number of redundant
  points. Many of these can be removed without a noticeable loss of
  shape quality using the *Douglas-Peucker Algorithm*.
* **Encoding all points in a shape into a single value.** The *Encoded
  Polyline Algorithm* used in the Google Maps JavaScript API can also
  be used with GTFS shapes. This reduces the amount of storage
  required and also makes looking up all points in a shape far
  quicker.

### Reducing Points in a Shape

Many of the shapes you find in GTFS feeds are extremely detailed. They
often follow the exact curvature of the road and may consist of hundreds
or thousands of points for a trip that might have only 30 or 40 stops.

While this level of detail is useful, the sheer amount of data required
to be rendered on a map can be a massive performance hit from the
perspective of retrieving the data as well as rendering on a map.
Realistically, shapes do not need this much detail in order to convey
their message to your users.

Consider the following shape from Portland that has been rendered using
Google Maps. The total shape consists of 1913 points.

![Original Shape](images/shape-original.jpg)

Compare this now to the same shape that has had redundant points
removed. The total number of points in this shape is 175, which
represents about a 90% reduction.

![Reduced Shape](images/shape-reduced.jpg)

If you look closely, you can see some minor loss of detail, but for the
most part, the shapes are almost identical.

This reduction in points can be achieved using the Douglas-Peucker
Algorithm. It does so by discarding points that do not deviate
significantly between its surrounding points.

The Douglas-Peucker Algorithm works as follows:

* Begin with the first and last points in the path (A and B). These
  are always kept.
* Find the point between the first and last that is furthest away from
  the line joining the first and last line (the orthogonal distance --
  see the figure below).
* If this point is greater than the allowed distance (the tolerance
  level), the point is kept (call it X).
* Repeat this algorithm twice: once using A as the first point and X
  as the last point, then again using X as the first point and B as
  the last point.

This algorithm is recursive, and continues until all points have been
checked.

***Note:** The tolerance level determines how aggressively points are
removed. A higher tolerance value is less aggressive and discards less
data, while a lower tolerance discards more data.*

The following diagram shows what orthogonal distance means.

![Orthogonal Distance](images/orthogonal-distance.jpg)

The following resources provide more information about the
Douglas-Peucker Algorithm and describe how to implement it in your own
systems:

* <http://en.wikipedia.org/wiki/Ramer-Douglas-Peucker_algorithm>
* <http://www.loughrigg.org/rdp/>
* <http://stackoverflow.com/questions/2573997/reduce-number-of-points-in-line>.

You can often discard about 80-90% of all shape data before seeing a
significant loss of line detail.

### Encoding Shape Points

A single entry in `shapes.txt` corresponds to a single point in a
single shape. Each entry includes a shape ID, a latitude and longitude.

***Note:** The `shape_dist_traveled` field is also included, but you do not
strictly need to use this field (nor the corresponding field in
stop_times.txt). The technique described in this section will not work
if you intend to use `shape_dist_traveled`.*

This means if you want to look up a shape by its ID, you may need to
retrieve several hundreds of rows from a database. Using the Encoded
Polyline Algorithm you can change your GTFS database so each shape is
represented by a single row in a database. This means the shape can be
found much more quickly and much less data needs to be processed to
determine the shape.

Consider the following data, taken from TriMet's `shapes.txt` file.
This data represents the first five points of a shape.

| `shape_id` | `shape_pt_lat` | `shape_pt_lon` |  `shape_pt_sequence` |  `shape_dist_traveled` |
| :--------- | :------------- | :------------- | :------------------- | :--------------------- |
| `185328`   | `45.52291 `    | `-122.677372`  |  `1`                 |  `0.0`                 |
| `185328`   | `45.522921`    | `-122.67737`   |  `2`                 |  `3.7`                 |
| `185328`   | `45.522991`    | `-122.677432`  |  `3`                 |  `34.0`                |
| `185328`   | `45.522992`    | `-122.677246`  |  `4`                 |  `81.5`                |
| `185328`   | `45.523002`    | `-122.676567`  |  `5`                 |  `255.7`               |

If you apply the Encoded Polyline Algorithm to this data, the
coordinates can be represented using the following string.

```
eeztGrlwkVAAML?e@AgC
```

To learn how to arrive at this value, you can read up on the Encoded
Polyline Algorithm at
<https://developers.google.com/maps/documentation/utilities/polylinealgorithm>.

Instead of having every single shape point in a single table, you can
create a table that has one record per shape. The following SQL
statement is a way you could achieve this.

```sql
CREATE TABLE shapes (
  shape_id TEXT,
  encoded_shape TEXT
);
```

The following table shows how this data could be represented in a
database.

| `shape_id` | `encoded_shape`        |
| :--------- | :--------------------- |
| `185328`   | `eeztGrlwkVAAML?e@AgC` |

Storing the shape in this manner means you can retrieve an entire shape
by looking up only one database row and running it through your decoder.

To further demonstrate how both the encoding and decoding works, try out
the polyline utility at
<https://developers.google.com/maps/documentation/utilities/polylineutility>.

You can find implementations for encoding and decoding points for
various languages at the following locations:

* <http://facstaff.unca.edu/mcmcclur/GoogleMaps/EncodePolyline/>
* <https://github.com/emcconville/google-map-polyline-encoding-tool>