aboutsummaryrefslogtreecommitdiff
path: root/gtfs-book/ch-16-deleting-unused-data.md
diff options
context:
space:
mode:
Diffstat (limited to 'gtfs-book/ch-16-deleting-unused-data.md')
-rw-r--r--gtfs-book/ch-16-deleting-unused-data.md106
1 files changed, 106 insertions, 0 deletions
diff --git a/gtfs-book/ch-16-deleting-unused-data.md b/gtfs-book/ch-16-deleting-unused-data.md
new file mode 100644
index 0000000..f2a2fc0
--- /dev/null
+++ b/gtfs-book/ch-16-deleting-unused-data.md
@@ -0,0 +1,106 @@
+## 16. Deleting Unused Data**
+
+Once you have imported a GTFS feed into a database, it is possible for
+there to be a lot of redundant data. This can ultimately slow down any
+querying of that data as well as bloating the size of the database. If
+you are making an app where the GTFS database is queried on the device
+then disk space and computational time are at a premium, so you must do
+what you can to reduce resource usage.
+
+The first thing to check for is expired services. You can do this by
+searching `calendar.txt` for entries that expire before today's date.
+Be aware though, you also need to ensure there are no
+`calendar_dates.txt` entries overriding these services (a service
+could have an `end_date` of, say, `20140110`, but also have a
+`calendar_dates.txt` entry for `20140131`).
+
+Firstly, find service IDs in `calendar_dates.txt` that are still
+active using the following query.
+
+```sql
+SELECT service_id FROM calendar_dates WHERE date >= '20140110';
+```
+
+***Note:** In order to improve performance of searching for dates, you
+should import the date field in `calendar_dates.txt` as an integer, as
+well as `start_date` and `end_date` in `calendar.txt`.*
+
+Any services matched in this query should not be removed. You can then
+find service IDs in `calendar.txt` with the following SQL query.
+
+```sql
+SELECT * FROM calendar WHERE end_date < '20140110'
+ AND service_id NOT IN (
+ SELECT service_id FROM calendar_dates WHERE date >= '20140110'
+ );
+```
+
+Before deleting these services, corresponding trips and stop times must
+be removed since you need to know the service ID in order to delete a
+trip. Likewise, stop times must be deleted before trips since you need
+to know the trip IDs to be removed.
+
+```sql
+DELETE FROM stop_times WHERE trip_id IN (
+ SELECT trip_id FROM trips WHERE service_id IN (
+ SELECT service_id FROM calendar WHERE end_date < '20140110'
+ AND service_id NOT IN (
+ SELECT service_id FROM calendar_dates WHERE date >= '20140110'
+ )
+ )
+ );
+```
+
+Now there may be a series of trips with no stop times. Rather than
+repeating the above sub-queries, a more thorough way of removing trips
+is to remove trips with no stop times.
+
+```sql
+DELETE FROM trips WHERE trip_id NOT IN (
+ SELECT DISTINCT trip_id FROM stop_times
+);
+```
+
+With all `service_id` references removed, you can remove the expired
+rows from `calendar.txt` using the following SQL query.
+
+```sql
+DELETE FROM calendar WHERE end_date < '20140110'
+ AND service_id NOT IN (
+ SELECT DISTINCT service_id FROM calendar_dates WHERE date >= '20140110'
+ );
+```
+
+The expired rows in `calendar_dates.txt` can also be removed, which
+can be achieved using the following query.
+
+```sql
+DELETE FROM calendar_dates WHERE date < '20140110';
+```
+
+There may now be some stops that are not used by any trips. These can be
+removed using the following query.
+
+```sql
+DELETE FROM stops WHERE stop_id NOT IN (
+ SELECT DISTINCT stop_id FROM stop_times
+);
+```
+
+Additionally, you can remove unused shapes and routes using the
+following queries.
+
+```sql
+DELETE FROM shapes WHERE shape_id NOT IN (
+ SELECT DISTINCT shape_id FROM trips
+);
+
+DELETE FROM routes WHERE route_id NOT IN (
+ SELECT DISTINCT route_id FROM trips
+);
+```
+
+There are other potential rows that can be removed (such as records in
+`transfers.txt` that reference non-existent stops), but hopefully you
+get the idea from the previous queries.
+