16. Deleting Unused Data**
Once you have imported a GTFS feed into a database, it is possible for there to be a lot of redundant data. This can ultimately slow down any querying of that data as well as bloating the size of the database. If you are making an app where the GTFS database is queried on the device then disk space and computational time are at a premium, so you must do what you can to reduce resource usage.
The first thing to check for is expired services. You can do this by
searching calendar.txt
for entries that expire before today's date.
Be aware though, you also need to ensure there are no
calendar_dates.txt
entries overriding these services (a service
could have an end_date
of, say, 20140110
, but also have a
calendar_dates.txt
entry for 20140131
).
Firstly, find service IDs in calendar_dates.txt
that are still
active using the following query.
SELECT service_id FROM calendar_dates WHERE date >= '20140110';
Note: In order to improve performance of searching for dates, you
should import the date field in calendar_dates.txt
as an integer, as
well as start_date
and end_date
in calendar.txt
.
Any services matched in this query should not be removed. You can then
find service IDs in calendar.txt
with the following SQL query.
SELECT * FROM calendar WHERE end_date < '20140110'
AND service_id NOT IN (
SELECT service_id FROM calendar_dates WHERE date >= '20140110'
);
Before deleting these services, corresponding trips and stop times must be removed since you need to know the service ID in order to delete a trip. Likewise, stop times must be deleted before trips since you need to know the trip IDs to be removed.
DELETE FROM stop_times WHERE trip_id IN (
SELECT trip_id FROM trips WHERE service_id IN (
SELECT service_id FROM calendar WHERE end_date < '20140110'
AND service_id NOT IN (
SELECT service_id FROM calendar_dates WHERE date >= '20140110'
)
)
);
Now there may be a series of trips with no stop times. Rather than repeating the above sub-queries, a more thorough way of removing trips is to remove trips with no stop times.
DELETE FROM trips WHERE trip_id NOT IN (
SELECT DISTINCT trip_id FROM stop_times
);
With all service_id
references removed, you can remove the expired
rows from calendar.txt
using the following SQL query.
DELETE FROM calendar WHERE end_date < '20140110'
AND service_id NOT IN (
SELECT DISTINCT service_id FROM calendar_dates WHERE date >= '20140110'
);
The expired rows in calendar_dates.txt
can also be removed, which
can be achieved using the following query.
DELETE FROM calendar_dates WHERE date < '20140110';
There may now be some stops that are not used by any trips. These can be removed using the following query.
DELETE FROM stops WHERE stop_id NOT IN (
SELECT DISTINCT stop_id FROM stop_times
);
Additionally, you can remove unused shapes and routes using the following queries.
DELETE FROM shapes WHERE shape_id NOT IN (
SELECT DISTINCT shape_id FROM trips
);
DELETE FROM routes WHERE route_id NOT IN (
SELECT DISTINCT route_id FROM trips
);
There are other potential rows that can be removed (such as records in
transfers.txt
that reference non-existent stops), but hopefully you
get the idea from the previous queries.