New York Taxi Demand Prediction And Price Forecasting

Scalable PySpark Solution for Big Data

8 min readDec 16, 2023

This article is co-written with Preeti Reddy Koppolu.

Yellow taxis are a familiar sight on the vibrant streets of New York City, offering a convenient travel option for both residents and visitors. However, for newcomers and tourists, one significant challenge remains: estimating the cost of a taxi journey from one point to another. This article addresses this key issue, exploring the intricate relationship between taxi demand at various pickup locations and how it influences fare prices.

Our approach to addressing this challenge involved creating two sophisticated machine learning models. The first model is designed to accurately predict the demand for taxi services, while the second model is focused on forecasting the fare for each ride. These models have been seamlessly integrated into a Flask-based web application. This application offers a user-friendly interface where individuals can input their journey’s start and end points, along with their desired travel time. It provides real-time information on current taxi demand and the corresponding fare, along with predictions of how demand might change in the upcoming hour and its potential impact on fares. Additionally, users have access to basic details about their ride, such as the expected route, distance, and duration.

The code for this project is available on GitHub at this link.

Methodology

The block diagram provides an overview of the methodology applied to address the problem statement, which will be discussed in detail throughout this article.

1. Data Collection

Trip Records:

Source: NYC Taxi And Limousine
Coverage: January 2021 to December 2022
Size: 70 million rows
File Format : Parquet

Coverage: January 2021 to December 2022
Size: 70 million rows

Weather:

Source: Visual Crossing API

Public Holidays:

Source: Azure Open Datasets

Location Coordinates:

Source: Google maps API

2. Data Storage

The data collected from various sources was initially stored in Azure Blob Storage, ensuring accessibility for further stages.

3. Data Preparation

3.1 Data Joining

For joining the datasets, hourly time bins were created in the main trip records dataset. The weather and holiday datasets are then joined with the main dataset in these time bins. The location dataset was joined to the main dataset on pickup location ID and drop-off location ID.

3.2 Feature Engineering

“is_holiday”
The is_holidayfeature is a categorical variable designed to enhance our analysis. It takes into account both the day of the week and public holidays in New York. Specifically, this feature is assigned a value of 0 for regular weekdays, reflecting a typical working-day scenario. On the other hand, weekends and days recognized as public holidays in New York are marked with a value of 1. This differentiation allows us to accurately reflect the unique demand patterns on holidays and weekends compared to regular weekdays. We utilized the day of the week function in combination with New York's holiday calendar to get this feature.
“trip_duration”
The trip_durationfeature is a crucial component of our analysis, offering insights into the length of each taxi trip. To calculate this, we utilized the difference between the pickup and drop off timestamps of the taxi trips. Specifically, we subtracted the Unix timestamp of thetpep_pickup_datetime from thetpep_dropoff_datetime, which gave us the duration of each trip in seconds. By doing this, trip_duration provides a clear and concise measurement of the time each journey takes, an essential metric for price prediction model.
“speed_mph”
In our dataset, we further refined our analysis by introducing the speed_mph feature, which represents the average speed of each taxi trip in miles per hour. Initially, we calculated trip_duration_hours by dividing the trip_duration (previously computed in minutes) by 60 to convert it into hours. This computation gives us the average speed of the taxi trip in miles per hour. . This ensures our model accounts for the speed factor, which is vital for understanding traffic conditions.
“demand_category”
We introduced the demand_category feature. This categorization was based on the number of pickups (no_of_pickups) during each time bin at each location. We categorized the demand as low (1) for 2 or fewer pickups, medium (2) for pickups ranging from 3 to 19, and high (3) for more than 19 pickups. After categorizing, the no_of_pickups column, having served its purpose, was dropped. This classification provides a clear, tiered understanding of taxi demand, enhancing our predictive model's ability to forecast demand under varying conditions and times.
“demand_category_squared”
We have also introduced thedemand_category_squared feature. Similar to our original “demand_category” feature, this new addition amplifies the categorization of taxi demand, providing our predictive model with an even more refined perspective on demand patterns. Just as before, it is based on the number of pickups (no_of_pickups) during specific time bins at each location. This feature further accentuates our demand categorization by squaring the values of the original categories: low (1) remains 1, medium (2) becomes 4, and high (3) transforms into 9. This enhancement allows our model to not only distinguish between demand levels but also capture the degree of variance within each category.

3.3 Data Cleaning

Breaking Down the Date and Time Data
We divided the date and time information into separate columns for better clarity and analysis. This change gave us distinct columns for “year”, “month”, “day”, “hour”, and “minute”.

Refining Fare Calculation
We made adjustments to how we calculate taxi fares. Specifically, we removed the tip amount, as it didn’t significantly impact the overall fare calculation. However, we included congestion surcharge and airport fees, which vary based on the pickup and drop-off locations, to provide a more accurate fare estimate.

Streamlining Our Dataset
To enhance our data’s efficiency and relevance, we dropped several columns that were no longer necessary. These included “tip_amount”, “fare_amount”, ‘extra’, ‘mta_tax’, ‘tolls_amount’, ‘improvement_surcharge’, ‘congestion_surcharge’, “airport_fee”, “RatecodeID”, “store_and_fwd_flag”, and “LocationID”. Some of these were either consolidated into a single column or removed because they didn’t add substantial value to our analysis.

4. Data Modelling

We developed two distinct but interconnected predictive models:

4.1 Demand Prediction

This is framed as a classification problem. We used a range of factors as independent variables to predict demand, including temperature, year, month, day, hour, and the geographical coordinates of the pickup location (Latitude and Longitude). Our goal was to categorize demand into different levels, which served as our dependent variable, named ‘Demand Category’.

To develop and test our models, we divided our dataset into two parts: 80% of the data was used for training, and the remaining 20% for testing. We explored three different classification algorithms to find the most effective one: Logistic Regression, Decision Tree Classifier, and Random Forest Classifier. The main criterion for evaluating the performance of these classifiers was their accuracy in predicting demand categories.

Our findings showed that the Decision Tree Classifier outperformed the others, yielding the highest accuracy. To further refine this model, we applied a technique known as hyperparameter tuning, using CrossValidator. This process involves adjusting various hyperparameters of the model to achieve the best performance.

After fine-tuning these parameters, our decision tree classifier reached a final accuracy of 83%.

4.2 Price Forecasting

The objective was to forecast the total fare amount for a trip, considering a variety of factors. The independent variables used in the models included time elements (year, month, day, hour, minute), environmental and contextual factors (temperature, whether it was a weekend or holiday), and trip specifics (passenger count, duration, distance, speed, demand category, and squared demand category).

To achieve this, two different regression models, Linear Regression and Random Forest Regressor, were tested. The effectiveness of these models was measured using the R-squared metric, which indicates how well the predicted values align with actual fare amounts.

Interestingly, while the linear regression model showed better performance in terms of R-squared value, the random forest regression model was ultimately selected for the final implementation. This decision was made because the linear regression model tended to predict a consistent percentage increase in price for the next hour, which did not align well with the more dynamic nature of taxi fares.

To optimize the Random Forest Regressor, hyperparameter tuning was conducted. After this optimization, the Random Forest model achieved an impressive R-squared score of 0.92, demonstrating its strong capability of accurately forecasting taxi fares in the city.

5. Deployment

Taxi demand and fare prediction models are made accessible through a user-friendly web interface, powered by the Flask framework. This interface is designed for users to easily obtain real-time taxi demand and fare estimates.

When using the application, users are prompted to enter key travel details: the pickup and drop-off locations, the number of passengers, and the desired date and time of travel. To enrich the accuracy of predictions, the system dynamically retrieves additional information via API calls—weather conditions from the Visual Crossing API and travel distance and duration from the Google Maps API.

The user-provided and API-retrieved data are first processed by the demand prediction model, which predicts taxi demand for both the current hour and the subsequent hour. These demand predictions are then fed into the price forecast model, which calculates the expected fare for the present time and for the next hour.

The results, encompassing both demand and price forecasts, are neatly displayed to the user, ensuring they have a clear understanding of the taxi landscape for their planned journey.

6. Conclusion

In conclusion, our study of New York City’s taxi data from 2021 to 2022 shows us how to predict taxi demand and prices better. We used information like weather, holidays, and where people are in the city to understand when and why people need taxis more and how much rides might cost. This helps taxi companies plan better, makes rides easier for passengers, and could even make city travel smoother and greener. Our work is a big step towards making taxi rides in New York easier to predict and more efficient for everyone.