In this project, we'll apply ETL pipeline, NLP pipeline, and ML pipeline to analyze disaster data from Figure Eight to build a model for an API that classifies disaster messages.
This is one of the most important problem in data science and machine learning. During a disaster, we get millions and millions of messages either direct or via social media. We'll probably see 1 in 1000 messages that are relevant. Few important words like water, blocked road, medical supplies are used during a disaster response. We have a categorical dataset with which we can train an ML model to see if we identify which messages are relevant to disaster response.
In this project three main features of a data science project have been utilized:
- Data Engineering - In this section I worked on how to Extract, Transform and Load the data. After that I prepared the data for model training. For preparation I cleaned the data by removing bad data (ETL pipeline) then used NLTK to tokenize, lemmatize the data (NLP Pipeline). Finally used custom features like StartingVerbExtractor, StartingNounExtractor to add new to the main dataset.
- Model Training - For model training I used XGBoost Classifier to create the ML pipeline.
- Model Deployment - For model deployment, I used the flask API.
This project is done on anaconda platform using jupyter notebook jupyter notebook. The detailed instruction of how to install anaconda can be found here. To create a virtual environment see here
in the virtual environment, clone the repository :
git clone https://github.com/abhishek-jana/Disaster-Response-Pipelines.git
Python Packages used for this project are:
Numpy
Pandas
Scikit-learn
xgboost
NLTK
regex
sqlalchemy
Flask
Plotly
To install the packages, run the following command:
pip install -r requirements.txt
-
Run the following commands in the project's root directory to set up your database and model.
- To run ETL pipeline that cleans data and stores in database
python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
- To run ML pipeline that trains classifier and saves
python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
- To run ETL pipeline that cleans data and stores in database
-
Run the following command in the app's directory to run your web app.
python run.py
-
Go to http://0.0.0.0:3001/
The project is structured as follows:
data folder contains the data "disaster_categories.csv", "disaster_messages.csv" to extract the messages and categories. "DisasterResponse.db" is a cleaned version of the dataset save in sqllite database. "ETL Pipeline Preparation.ipynb" is the jupyter notebook explaining the data preparation method. "process_data.py" is the python script of the notebook.
"ML Pipeline Preparation.ipynb" is the jupyter notebook explaining the model training method. The relevant python file "train_classifier.py" can be found in the models folder. Final trained model is saved as "classifier.pkl" in the models folder.
app folder contains the "run.py" script to render the visualization and results in the web. templates folder contains the .html files for the web interface.
The accuracy, precision and recall are:
accuracy
precision and recall
Some of the predictions on messages are given as well:
message 1
message 2
message 3
In future I am planning to work on the following areas of the project:
-
Testing different estimators and adding new features in the data to improve the model accuracy.
-
Add more visualizations to understand the data.
-
Improve the web interface.
-
Based on the categories that the ML algorithm classifies text into, advise some organizations to connect to.
-
This dataset is imbalanced (ie some labels like water have few examples). In the README, discuss how this imbalance, how that affects training the model, and your thoughts about emphasizing precision or recall for the various categories.
I am thankful to Udacity Data Science Nanodegree program and figure eight for motivating me in this project.
I am also thankful to figure eight for making the data publicly available.