Data Experiments at Remix: Estimating Passenger Load with Incomplete Data

Santiago Toso

Data Scientist

Passenger load informs what the demand for transit is so that agencies can provide supply smartly. Image by Adi Talwar.

Passenger load is an important performance indicator for a transit system because it provides information on how full or empty vehicles are. If passenger load for a particular bus is over capacity, a planner might make the decision to reallocate more resources to increase the frequency of that bus’s route. Ultimately, passenger load informs what the demand for transit is so that agencies can provide supply smartly.

Passenger load is calculated by adding boardings and subtracting alightings at each stop.

The biggest challenge with passenger load is that most agencies do not have the underlying data to calculate it. Most agencies have access to boarding data, which they collect when passengers board the bus and “tap on”. However, boarding data only shows half the story, and because “tapping-off” systems are less common, agencies are usually missing the alighting data that shows where passengers are getting off the bus.

When passengers don't tap-off, it is hard to get the alighting information, and therefore, the load.

At Remix, we’re always thinking through innovative solutions to help our customers fill their gaps. We know that alighting data is extremely valuable to agencies for estimating load, but this data is often unavailable in systems that don’t support “tapping off”. We decided to run a pilot with sample data from one of our customer agencies and create a predictive model to estimate missing alighting data.

Our Methodology

We approached this question through predictive modeling where we:

  1. Start by analyzing a historical dataset that includes the target, which is the variable we want to predict (in our case, the number of alightings), and several explanatory variables that can help explain the target’s behavior (in our case these might include boardings, trips per hour, jobs nearby, etc).
  2. Use part of the data (referred to as the train dataset) to create a model that explains the relationship between the explanatory variables and the target.
  3. Then, use the remaining part of the data (the test dataset) to test the performance of the model. Since our test set already includes actual alighting numbers, we can compare the model’s predictions to the real values to evaluate its accuracy.
  4. Ultimately, we want to use the model to predict the target variable (alightings) in incomplete datasets.
The diagram illustrates the predictive model process at a high-level.

Our Experiment

We created several predictive models for our target variable, alightings, and compared the performance of these models to the real alighting numbers in the test dataset.

The Results

The best performing model had the following metrics:

  • Correlation coefficient (R squared) = 0.77
  • Root Mean Square Error (RMSE) = 21.9

In other words:

  • The model can explain 77% of the variation of the alightings using the explanatory variables.
  • On average, the model misses its estimates by 21.9 alightings.

These are pretty good results considering that we had stops with more than 400 alightings!

The scatter plot compares real and predicted alighting values.

Missing by an average of twenty alightings might seem high, but remember that averages are influenced by extreme values. The graph above, shows that there are a few points that fall very far from the line. These are outliers and they are responsible for considerably increasing the mean error of the estimates. Out treatment of outliers might change in future predictive models as we introduce more data.  

In order to investigate further on the accuracy of our model, we took a look at the R squared per route.

Performance of the predictive model, by route.

We can see that performance across individual bus routes vary significantly. From this analysis, we discovered that routes with more information have a bigger influence in the model’s creation and therefore overfit the model to the best performing lines. The routes that show a performance of 70% or higher are responsible for 83% of the boardings. This is a good discovery for future modeling. Perhaps this model should only be used for routes with a minimum number of boarding and alighting data.

But wait, wait, wait. Are you telling me that I need to start with a data set that includes alightings? I thought the whole point of this experiment is that we don’t have this information!

Don’t worry, this is where things get really interesting. When we built the predictive model, we needed to first test its accuracy before incorporating another layer to the experiment. Now that we know the predictive model can yield a performance of nearly 80%, we felt optimistic enough to think through another methodology for when alighting data is completely missing.

Here’s a hypothetical situation to illustrate our thinking:

  • Let’s say we have data from four cities.
  • These four cities are similar in size, population density, jobs, transit network design, etc.
  • We have data on boardings and alightings in three out of the four cities.
  • In the fourth city we are missing data in alightings.

We can create a model from the data of the first three cities and use said model on the fourth city to get an estimate on alightings.

We hope to explore a solution for a city or transit agency that does not have data on alightings.

The meaty part here is being able to find similarities across cities by describing them “abstractly.” For instance, mathematical models don’t care if the city’s name is Chicago or San Francisco. In order to find similarities, we flatten a city’s character into variables like boardings, jobs, population, number of trips, etc. In doing so, we can make cities comparable. The image below demonstrates visual abstractions (in this case, a transit network without stops) can help us identify similarities.

Describing cities abstractly allows us to find similarities. Left: Lahti (Finland) vs Örebro (Sweden). Right: Chicago (US) vs San Francisco (US)

We are excited about this added layer in predictive modeling and hope to see a future in which predictive modeling fills in data gaps for passenger load and visualize this data in Remix. Once passenger load (the “demand” for transit) is in Remix, planners can use our platform to design the “supply” of transit.

Prototype of passenger load data in a Remix map.

But before we invest more in this experiment, we want to hear from you:

  • What level of performance should we strive for in order for planners to trust this approach?
  • If we successfully provided passenger load estimates for you, what would you do with this data?

If you want to share your thoughts or learn more about our experiment, please reach out to us here or schedule a call with your customer success manager. Thanks for sharing!