We have an AWS Lambda Function pulling One Bus Away API responses for King County Metro.
agency = '1' # this is the agency ID for King County Metro
base_url = 'http://pugetsound.onebusaway.org/api/'
endpoints = {'position': 'gtfs_realtime/vehicle-positions-for-agency/{agency}.pb',
'alert': 'gtfs_realtime/alerts-for-agency/{agency}.pb',
'update': 'gtfs_realtime/trip-updates-for-agency/{agency}.pb'}
We're focusing on the position
files. If you call [Ping] the One Bus Away API's position endpoint at any point in time, it will return the last recorded vehicle location (lat/lon) for buses travelling along their routes at that time:
entity {
id: "2"
vehicle {
trip {
trip_id: "34746616"
route_id: "100136"
}
position {
latitude: 47.657772
longitude: -122.14334
}
timestamp: 1528154042
vehicle {
id: "7222"
}
}
}
The time and distance between bus location records vary. We're pinging the One Bus Away API every minute.
We can plot one set of vehicle positions along a route (in this case, route 7). In the gtfs, there's a file describing this particular vehicle's (vehicle_id) journey, or trip (trip_id) at a given time of day going a specific direction along route 7.
We can plot another trip here:
If we plot the trips together you can see that the vehicle locations are not consistent one trip to another.
Let's zoom into the route and see what other information we have:
For each route, the gtfs gives us:
- route vertex points (all the lat/lon coordinates that make up a route's "shape")
- and an indication whether the route vertex is associated with a bus stop
From the image above, there are some important things to note:
- route vertex points occur at every street intersection
- if there's a bus stop, another route vertex point occurs at the same location along the centerline of the street
- the bus
positions
orvehicle locations
do not line up directly with either the route vertex points or the bus stops
Our goal is to better understand how buses travel along their routes. One natural question is "How fast is the bus moving along the route?" To get speed, we need:
- change in time (we have timestamps for each vehicle location)
- and distance traveled. To think about getting "distance traveled", it's helpful to look at the picture below:
In the picture, there are two vehicle positions along a route. First, the vehicle was at location 1 at time = t1. The next vehicle observation was at location 2 and time = t2. We can find distance traveled in 2 ways:
- Naive approach - we can take the straight line distance between location 1 and location 2 (ignoring the actual route). This will work if observations are close together but if two observations that are far apart and the route is non-linear, this naive approach will have a lot of error.
- Route aware - we can find the nearest route shape vertex to the vehicle location. The gtfs provides the distance traveled between each route vertex so there is enough information to calculate the
distance traveled
by taking shape_distance_traveledloc2-shape_distance_traveledloc1
Since bus riders are more familiar with distances and timing between stops, it's helpful to contextualize everything around bus stops. There are two ways we are doing this process:
- Find the nearest route vertex point to each vehicle location. If the nearest route vertex point is a bus stop, keep that row in the dataset. Otherwise, remove the observation (row) from the dataset.
- Find the nearest route vertex point to each vehicle location. Find the distance and time between route vertex points. Interpolate when the vehicle
would have been
at the bus stop in between route vertex points.
Please see the instructions below to set up your python environment and get started with the code.
Note: the data is stored using AWS. If AWS is unfamiliar, scroll down to "Using Amazon Web Services to access Ben’s data" below, for a little tutorial. Since there are some security concerns with an S3 bucket open to the public, please email Ben at ben.malnor@gmail.com to coordinate access.
Additionally, notebooks 2 and 3 reference GTFS feeds that should be downloaded from https://transitfeeds.com/p/king-county-metro/73 and unzipped to data/source/gtfs_YYYYMMDD
folders.
download_raw_locations.sh
: downloads raw bus position data. In the file, you'll have to edit the year/month you're looking for:
s3:/bus350-data/unpacked/2019/06/
the above will download raw data for year:2019, month:06.
to see what month's are for each year, you can type aws s3 ls s3:/bus350-data/unpacked/2018/
01_transform_source_data.ipynb
: transforms said data into a pandas DataFrame indexed on the datetime - the output of this is available on S3 in filepositions_201801.h5
02_transform_e_locations.ipynb
: selects northbound E-line vehicles and calculatesclosest_stop_id
(used in future analysis) - the output of this is available on S3 in filee_northbound_locations_2018-01.h5
03_e_segment_analysis.ipynb
: transforms data into a shape that will let us calculate time between two stops for northbound E (denny/aurora and 46th/aurora), then generates histograms for the distribution those commute times
In a fresh Python 3.6 env:
pip install pandas geopandas numpy shapely fiona six pyproj tables matplotlib tqdm geopy
In more detail, assuming you have successfully installed Anaconda on your system:
On Mac, you can set up a Python 3.6 environment using conda
, but you need to install the above packages with pip
.
#Create a new conda environment named `realtime-buses` with Python version 3.6 (3.7 does not work) and the ipython kernel
conda create --name realtime-buses python=3.6 ipykernel
#Activate the new environment
source activate realtime-buses
#Use pip to nstall modules needed for geopandas
pip install geopandas numpy pandas shapely fiona six pyproj tables matplotlib tqdm geopy
#Install the kernel for the new environment (for the current user) so Jupyter will detect it
ipython kernel install --user --name realtime-buses --display-name "Python 3.6 (realtime-buses)"
#Or... not sure what the difference is:
#python -m ipykernel install --user --name realtime-buses --display-name "Python 3.6 (realtime-buses)"
#Launch Jupyter in your browser. The directory from which you type the command will be
#the top level directory in your Jupyter session, and you can navigate down from there
#if needed. Click on a .ipynb file to open it, or click the 'New' button to create
#a new notebook. You may have to explicitly select the "Python 3.6 (realtime-buses)" kernel.
jupyter notebook
On Windows, it should work to install everything with conda
. Instead of tables
, install pytables
(this is needed to work with .h5
files).
#If you don't have Anaconda installed, install it from here. NOTE: if you don't check the box for adding Conda folders to your path, you will likely have trouble later.
#Create a new conda environment named `realtime-buses` with Python version 3.6 (3.7 does not work) and the ipython kernel
conda create --name realtime-buses python=3.6 ipykernel
#Activate the new environment
conda activate realtime-buses
#On Windows, instead of pip:
conda install geopandas numpy pandas shapely fiona six pyproj pytables matplotlib tqdm geopy
#Install the kernel for the new environment (for the current user) so Jupyter will detect it
ipython kernel install --user --name realtime-buses --display-name "Python 3.6 (realtime-buses)"
#Launch Jupyter in your browser. The directory from which you type the command will be
#the top level directory in your Jupyter session, and you can navigate down from there
#if needed. Click on a .ipynb file to open it, or click the 'New' button to create
#a new notebook. You may have to explicitly select the "Python 3.6 (realtime-buses)" kernel.
jupyter notebook
** Windows installand run trouble shooting notes **
- If you followed the conda install's bad advice not to add conda to your path, you may have to add a bunch of stuff to your path (e.g. if you get an HTTP error involving ssl not found). I (Alice) found that I had to run setx PATH "%path%";c:\Users\Alice\Anaconda3;c:\Users\Alice\Anaconda3\scripts;c:\Users\Alice\Anaconda3\condabin.
- If you get a PackagesNotFound error for geopy:
conda config --append channels conda-forge
conda install geopy
\ - If on running from a Jupyter notebook, you get a package not installed error for random package that succeeded to install in previous steps (e.g. geopandas), just reinstall it from conda-forge (e.g. conda install -c conda-forge geopandas).
Using Amazon Web Services to access Ben’s data
If you don’t have an AWS account, sign up at https://aws.amazon.com/
If you don’t have the AWS cli
install it from https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html
At this point, if aws commands in the cli return “Unable to locate credentials”, you need to get Amazon Identity and Access Management (IAM) credentials
And set up the credentials as described here: https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-started_create-admin-group.html, under "creating an aws use..."r. You will be asked for a user name and password.
This will generate an email, use the sign-in url in the email, and the user name and password you created in the previous step.
At this point you are in the console, it has a link to the Identity and Access Management (IAM) console. Go there.
And get an access key id and access key as described here: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html#cli-quick-configuration
Double click on users
Check box, then double click on a user (e.g. Admin),
Select the Security Credentials tab, and click Get Access Key
Now back in your dos command prompt, type
Aws configure
And supply your credentials
Now just grab the data from https://github.com/350Seattle-Transportation-Team/gtfs-realtime (green download button)