Skip to content

carlosfab/taxi_demand_predictor

Repository files navigation

Taxi Demand Prediction πŸš—

Predict the upcoming hour's user demand for NYC taxi. This enables optimization of fleet distribution and potential revenue increase. This guide covers all project aspects, from setup to contribution.

Banner

Note: This project is currently a work in progress. I will be making significant updates throughout this week.



Table of Contents



Web App Overview

The app employs a trained LightGBM model to predict taxi demands for the next hour. Data from January 2022 onwards is stored in a Feature Store on HopsWorks, and new data is fetched every 60 minutes using GitHub Actions.

Open in Streamlit

The web app serves as an interface to visualize real-time predictions, rendered as a time series plot. It offers filters for location selection and provides monitoring of key performance metrics.


Feature/Training/Inference Pipelines

(Back to top)

The MLOps strategy adopted here is built upon the FTI (Feature, Training, Inference) architecture. This architecture was proposed by Jim Dowling (CEO and Co-Founder of HopsWorks) and provides a unified structure that describes both batch ML systems and real-time ML systems. It does so primarily by categorizing operations into three well-defined and separate pipelines.

A significant part of the development process is prototyping and experimentation. In this project, Jupyter notebooks have been employed extensively for this purpose. Here's a breakdown of the notebooks corresponding to each of the FTI pipelines:

Image

πŸ“Š Feature Pipeline

The feature pipeline systematically fetches, transforms, and stores refined time-series taxi data, making it ready for machine learning model consumption and other analytics purposes.

Notebook Description
11_backfill_feature_store.ipynb The notebook downloads and consolidates raw taxi ride data from a specified range of years, transforms this data into a time-series format, and then uploads it to a centralized feature store.
12_simulated_feature_pipeline.ipynb The notebook fetches a batch of raw data simulating recent production information by adjusting historical data, transforms it into a time-series format, and subsequently populates a feature store with this data.

🎯 Training Pipeline

The Training Pipeline involves setting up data infrastructure, preprocessing, feature extraction, model training, and hyperparameter tuning to create a robust machine learning model.

Notebook Description
13_model_training_pipeline.ipynb This notebook establishes a training pipeline, creating a feature view, transforming time-series data into features and targets, splitting the data into training and testing sets, and tuning a LightGBM model using Optuna.

πŸ€– Inference Pipeline

The Inference Pipeline is structured to retrieve the latest data, process it, utilize a pre-trained model for prediction, and store the outputs efficiently for downstream applications.

Notebook Description
14_inference_pipeline.ipynb This notebook fetches recent data, predicts taxi demand using LightGBM, and saves predictions to a feature store.

For those interested in the granular details and the intricate processes behind each pillar of the architecture, the notebooks serve as a comprehensive guide. They offer code, visualizations, and explanations that provide clarity on the MLOps strategies employed in the project.


Code structure

(Back to top)

The project follows an organized directory structure, ensuring clarity, modularity, and ease of navigation. Here is a breakdown of the structure:

.
β”œβ”€β”€ README.md                     - provides an overview of the project

β”‚   β”œβ”€β”€ raw                       - contains the raw, unprocessed ride data.
β”‚   β”‚   β”œβ”€β”€ rides_2022-01.parquet 
β”‚   β”‚   β”œβ”€β”€ rides_2022-02.parquet 
β”‚   β”‚   └── ...
β”‚   └── transformed               - contains datasets that have undergone some form of processing
β”‚       β”œβ”€β”€ tabular_data.parquet  
β”‚       β”œβ”€β”€ ts_data_rides_2022_01.parquet  
β”‚       └── validated_rides_2022_01.parquet 
β”‚       └── ... 
β”œβ”€β”€ models                        - any machine learning models.
β”œβ”€β”€ notebooks                     - exploratory and developmental Jupyter notebooks.
β”‚   β”œβ”€β”€ 01_load_and_validate_raw_data.ipynb
β”‚   β”œβ”€β”€ 02_transform_raw_data_into_ts_data.ipynb
β”‚   β”œβ”€β”€ 03_transform_ts_data_into_features_and_targets.ipynb
β”‚   β”œβ”€β”€ ...
β”œβ”€β”€ pyproject.toml                - project metadata and dependencies
β”œβ”€β”€ scripts                       - scripts for automation, data collection, and other utilities.
β”œβ”€β”€ src                           - directory containing reusable code, functions, and classes.
└── tests                         - test scripts for functionalities

Installation

(Back to top)

To get started, you'll need to clone this repository and set up the environment:

git clone https://github.com/carlosfab/taxi_demand_predictor
cd taxi_demand_predictor
poetry install
poetry shell

About

End-to-end ML project for taxi demand predictor.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published