aboutsummaryrefslogtreecommitdiff
path: root/.idea/gtfs-book/ch-16-deleting-unused-data.md
blob: f2a2fc004d2eff3229bdf3e2ee675a6f319ff929 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
## 16. Deleting Unused Data**

Once you have imported a GTFS feed into a database, it is possible for
there to be a lot of redundant data. This can ultimately slow down any
querying of that data as well as bloating the size of the database. If
you are making an app where the GTFS database is queried on the device
then disk space and computational time are at a premium, so you must do
what you can to reduce resource usage.

The first thing to check for is expired services. You can do this by
searching `calendar.txt` for entries that expire before today's date.
Be aware though, you also need to ensure there are no
`calendar_dates.txt` entries overriding these services (a service
could have an `end_date` of, say, `20140110`, but also have a
`calendar_dates.txt` entry for `20140131`).

Firstly, find service IDs in `calendar_dates.txt` that are still
active using the following query.

```sql
SELECT service_id FROM calendar_dates WHERE date >= '20140110';
```

***Note:** In order to improve performance of searching for dates, you
should import the date field in `calendar_dates.txt` as an integer, as
well as `start_date` and `end_date` in `calendar.txt`.*

Any services matched in this query should not be removed. You can then
find service IDs in `calendar.txt` with the following SQL query.

```sql
SELECT * FROM calendar WHERE end_date < '20140110'
  AND service_id NOT IN (
    SELECT service_id FROM calendar_dates WHERE date >= '20140110'
  );
```

Before deleting these services, corresponding trips and stop times must
be removed since you need to know the service ID in order to delete a
trip. Likewise, stop times must be deleted before trips since you need
to know the trip IDs to be removed.

```sql
DELETE FROM stop_times WHERE trip_id IN (
   SELECT trip_id FROM trips WHERE service_id IN (
     SELECT service_id FROM calendar WHERE end_date < '20140110'
       AND service_id NOT IN (
         SELECT service_id FROM calendar_dates WHERE date >= '20140110'
       )
    )
  );
```

Now there may be a series of trips with no stop times. Rather than
repeating the above sub-queries, a more thorough way of removing trips
is to remove trips with no stop times.

```sql
DELETE FROM trips WHERE trip_id NOT IN (
  SELECT DISTINCT trip_id FROM stop_times
);
```

With all `service_id` references removed, you can remove the expired
rows from `calendar.txt` using the following SQL query.

```sql
DELETE FROM calendar WHERE end_date < '20140110'
  AND service_id NOT IN (
    SELECT DISTINCT service_id FROM calendar_dates WHERE date >= '20140110'
  );
```

The expired rows in `calendar_dates.txt` can also be removed, which
can be achieved using the following query.

```sql
DELETE FROM calendar_dates WHERE date < '20140110';
```

There may now be some stops that are not used by any trips. These can be
removed using the following query.

```sql
DELETE FROM stops WHERE stop_id NOT IN (
  SELECT DISTINCT stop_id FROM stop_times
);
```

Additionally, you can remove unused shapes and routes using the
following queries.

```sql
DELETE FROM shapes WHERE shape_id NOT IN (
  SELECT DISTINCT shape_id FROM trips
);

DELETE FROM routes WHERE route_id NOT IN (
  SELECT DISTINCT route_id FROM trips
);
```

There are other potential rows that can be removed (such as records in
`transfers.txt` that reference non-existent stops), but hopefully you
get the idea from the previous queries.