A two-stage predictive machine learning engine that forecasts the on-time performance of flights for 15 different airports in the USA based on data collected between 2016 and 2017.
- Check out the two-stage machine learning model here!
- Check out the details of the project in this Report!
Flights are said to be delayed when they arrive later than the scheduled arrival time. This delay is predominantly influenced by environmental conditions. Flight delay is vexatious for passengers and also incurs an agonizingly high financial loss to airlines and countries. A structured prediction system is an indispensable tool to help aviation authorities effectively alleviate flight delays. This project aims to build a two-stage machine learning engine to effectively predict the arrival delay of a flight in minutes after departure based on real-time flight and weather data. A classifier first predicts if the flight will be delayed or not, and subsequently a regression model predicts the arrival delay in minutes if the flight is expected to be delayed.
- Data directory setup
- Flight data pre-processing
- Weather data pre-processing
- Merging the flight and weather data
- Classifying flights as delayed or on-time
- Study of ways to handle class imbalance in the data set
- Random Under-sampling
- Random Over-sampling
- Synthetic Minority Over-sampling TEchnique (SMOTE)
- Comparison of the different sampling techniques to handle imbalance
- Regression model to predict the arrival delay in minutes
- Final implementation of the two-stage machine learning model to predict flight delay
The flight and weather data were combined into a single data set and pre-processed to train a two-stage machine learning model that predicts flight arrival delay. Due to class imbalance, there was an inherent bias towards the majority class, ’Not Delayed’ flights (class 0). The data was sampled using SMOTE before classification to overcome the bias. Out of five different algorithms, the Random Forest classifier gave the best F1 score (0.78) and Recall (0.74) for the delayed flights. Subsequently, the Random Forest regressor was pipe-lined, giving MAE 7.178 minutes and RMSE 11.283 minutes with an R2 score of 0.977. In conclusion, the flight delay prediction was efficient, and the Machine Learning model exhibited good performance.
It is recommended to have a Linux or macOS development environment for convenience, although the code runs on Windows 10.
Use Anaconda to manage your packages and Python 3 (version >= 3.8.0 recommended).
Running the code on Jupyter Notebook is also recommended.
- matplotlib
- scikit-learn
- numpy
- pandas
Remember to use conda, not pip for installing these
- missingno
- imblearn
- texlive-full
The data files for this project are maintained privately and not available for public use.