Taxi Demand Prediction 🚗

Predict the upcoming hour's user demand for NYC taxi. This enables optimization of fleet distribution and potential revenue increase. This guide covers all project aspects, from setup to contribution.

Note: This project is currently a work in progress. I will be making significant updates throughout this week.

Web App Overview

The app employs a trained LightGBM model to predict taxi demands for the next hour. Data from January 2022 onwards is stored in a Feature Store on HopsWorks, and new data is fetched every 60 minutes using GitHub Actions.

The web app serves as an interface to visualize real-time predictions, rendered as a time series plot. It offers filters for location selection and provides monitoring of key performance metrics.

Feature/Training/Inference Pipelines

(Back to top)

The MLOps strategy adopted here is built upon the FTI (Feature, Training, Inference) architecture. This architecture was proposed by Jim Dowling (CEO and Co-Founder of HopsWorks) and provides a unified structure that describes both batch ML systems and real-time ML systems. It does so primarily by categorizing operations into three well-defined and separate pipelines.

A significant part of the development process is prototyping and experimentation. In this project, Jupyter notebooks have been employed extensively for this purpose. Here's a breakdown of the notebooks corresponding to each of the FTI pipelines:

📊 Feature Pipeline

The feature pipeline systematically fetches, transforms, and stores refined time-series taxi data, making it ready for machine learning model consumption and other analytics purposes.

Notebook	Description
11_backfill_feature_store.ipynb	The notebook downloads and consolidates raw taxi ride data from a specified range of years, transforms this data into a time-series format, and then uploads it to a centralized feature store.
12_simulated_feature_pipeline.ipynb	The notebook fetches a batch of raw data simulating recent production information by adjusting historical data, transforms it into a time-series format, and subsequently populates a feature store with this data.

🎯 Training Pipeline

The Training Pipeline involves setting up data infrastructure, preprocessing, feature extraction, model training, and hyperparameter tuning to create a robust machine learning model.

Notebook	Description
13_model_training_pipeline.ipynb	This notebook establishes a training pipeline, creating a feature view, transforming time-series data into features and targets, splitting the data into training and testing sets, and tuning a LightGBM model using Optuna.

🤖 Inference Pipeline

The Inference Pipeline is structured to retrieve the latest data, process it, utilize a pre-trained model for prediction, and store the outputs efficiently for downstream applications.

Notebook	Description
14_inference_pipeline.ipynb	This notebook fetches recent data, predicts taxi demand using LightGBM, and saves predictions to a feature store.

For those interested in the granular details and the intricate processes behind each pillar of the architecture, the notebooks serve as a comprehensive guide. They offer code, visualizations, and explanations that provide clarity on the MLOps strategies employed in the project.

Code structure

(Back to top)

The project follows an organized directory structure, ensuring clarity, modularity, and ease of navigation. Here is a breakdown of the structure:

.
├── README.md                     - provides an overview of the project

│   ├── raw                       - contains the raw, unprocessed ride data.
│   │   ├── rides_2022-01.parquet 
│   │   ├── rides_2022-02.parquet 
│   │   └── ...
│   └── transformed               - contains datasets that have undergone some form of processing
│       ├── tabular_data.parquet  
│       ├── ts_data_rides_2022_01.parquet  
│       └── validated_rides_2022_01.parquet 
│       └── ... 
├── models                        - any machine learning models.
├── notebooks                     - exploratory and developmental Jupyter notebooks.
│   ├── 01_load_and_validate_raw_data.ipynb
│   ├── 02_transform_raw_data_into_ts_data.ipynb
│   ├── 03_transform_ts_data_into_features_and_targets.ipynb
│   ├── ...
├── pyproject.toml                - project metadata and dependencies
├── scripts                       - scripts for automation, data collection, and other utilities.
├── src                           - directory containing reusable code, functions, and classes.
└── tests                         - test scripts for functionalities

Installation

(Back to top)

To get started, you'll need to clone this repository and set up the environment:

git clone https://github.com/carlosfab/taxi_demand_predictor
cd taxi_demand_predictor
poetry install
poetry shell

Name	Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows	.github/workflows	Change path to notebook 14	Oct 4, 2023
img	img	FTI Architecture explanied	Oct 2, 2023
notebooks	notebooks	Streamlit App updated	Oct 4, 2023
scripts	scripts	Added files to script and src	Sep 25, 2023
src	src	Streamlit App updated	Oct 4, 2023
README.md	README.md	FTI Architecture explanied	Oct 2, 2023
pyproject.toml	pyproject.toml	Fixed bugs and updated Streamlit app	Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Taxi Demand Prediction 🚗

Table of Contents

Web App Overview

Feature/Training/Inference Pipelines

📊 Feature Pipeline

🎯 Training Pipeline

🤖 Inference Pipeline

Code structure

Installation

About

Releases

Packages

Languages

carlosfab/taxi_demand_predictor

Folders and files

Latest commit

History

Repository files navigation

Taxi Demand Prediction 🚗

Table of Contents

Web App Overview

Feature/Training/Inference Pipelines

📊 Feature Pipeline

🎯 Training Pipeline

🤖 Inference Pipeline

Code structure

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages