End-to-end news classification

Repository for end-to-end training and serving a news classifier.

Uses the News Aggregator Dataset to train a headline classifier.
Uses mlflow for keeping track of experiments and model management.
Uses FastAPI to provide simple REST APIs for serving and training models.

Repostitory structure

Illustration below does not include all files and may look different on your machine

.
├── README.md
├── data
│   ├── raw                                     # unzipped .csv dataset
│   └── zip                                     # zipped dataset
├── mlruns                                      # tracking ml runs
│   └── 1
│       └── 592dfd384f8243d3a772b1343d1646c0    # exemplary run log dir
├── notebooks
│   ├── exploratory.ipynb                       # eda notebook
│   └── modeling.ipynb                          # initial modeling notebook
├── pipeline.yaml                               # config for training a model
├── run_ml.py                                   # functions for running ml pipeline
├── app.py                                      # serving app entry point
└── src
    ├── api
    │   ├── endpoints.py                        # endpoint prediction training
    │   └── model.py                            # request response models
    ├── ml
    │   ├── eval.py                             # evaluation functionalities
    │   └── train.py                            # training functionalities
    └── utils
        └── utils.py                            # downloading and handling data

Setup

Docker (recommended)

The image can be built by cloning this repository and running:

$ docker build -t ing:latest .

The app can then be started by running:

$ docker run -d -p 8000:8000 --name ing-service ing

Anaconda (not recommended)

Alternatively, the project can be setup using Anaconda and the provided environment.yaml by running:

$ conda create -f environment.yml --name news

The app can be started by running:

$ uvicorn app:app

Usage

Train and deploy a news classifier

Assuming the app is running, the first step is to train a classifier using the train_and_deploy endpoint. Per default, the pipeline.yaml file is used to perform a gridsearch over hyperparameters before evaluation on the test dataset. Processing time is expected to be ~ 30 seconds. The trained classifier is then autoamtically deployed to serve the predict API. Optionally, this file can be modified with valid sklearn parameters before(!) starting the service with docker run.

$ curl -X GET "http://localhost:8000/train_and_deploy" -H  "accept: application/json" | json_pp

Example response:

    {
        "new model id": "5ae747ac8c274ae6ae2c722040403d79",
        "final hyperparameters": {
            "naivebayes__alpha": 1,
            "tfidf__use_idf": true,
            "vectorizer__max_features": 20000
        },
        "evaluation results": {
            "business": {
            "precision": 0.8956759882969313,
            "recall": 0.908455685719708,
            "f1-score": 0.9020205740520262,
            "support": 57961
            },
            "science and technology": {
            "precision": 0.9497659946513063,
            "recall": 0.9685197325291727,
            "f1-score": 0.9590511925009412,
            "support": 76270
            },
            "entertainment": {
            "precision": 0.9590336649189004,
            "recall": 0.8594703789908217,
            "f1-score": 0.90652647181435,
            "support": 22771
            },
            "health": {
            "precision": 0.898412581352901,
            "recall": 0.8989263577331759,
            "f1-score": 0.8986693961105425,
            "support": 54208
            },
            "accuracy": 0.9224184460963023,
            "macro avg": {
            "precision": 0.9257220573050097,
            "recall": 0.9088430387432196,
            "f1-score": 0.916566908619465,
            "support": 211210
            },
            "weighted avg": {
            "precision": 0.9227415044911694,
            "recall": 0.9224184460963023,
            "f1-score": 0.9222405845306619,
            "support": 211210
            }
        },
        "run information": {
            "artifact_uri": "file:///ing/mlruns/1/5ae747ac8c274ae6ae2c722040403d79/artifacts",
            "end_time": null,
            "experiment_id": "1",
            "lifecycle_stage": "active",
            "run_id": "5ae747ac8c274ae6ae2c722040403d79",
            "run_uuid": "5ae747ac8c274ae6ae2c722040403d79",
            "start_time": 1604862517413,
            "status": "RUNNING",
            "user_id": "root"
        }
    }

Serve the news classifier

After training and deploying a classifier, the predict endpoint can be used make predictions for new news titles. Note that the endpoint only uses the news title as a feature and accepts only a single string as input, i.e. does not support batch predictions (!).

$ curl -X POST -H "Content-Type: application/json" -d '{"title": "apple earnings crash after insider leak"}' localhost:8000/predict | json_pp

Example response:

    {
        "title": "apple earnings crash after insider leak",
        "label": "b",
        "category": "business"
    }

Offline experimentation (optional)

The app provides a command line interface (CLI) for ml experimentation and hyperparametersearch. One possible workflow is the following:

Edit the steps in the src.ml.trainmodule to different models.
Edit the pipeline.yaml to search for the models best hyperparameters.
Run $ python run_ml.py pipeline.yaml to experiment
Track experiments using the mlflow Tracking APIs e.g. by running $ mlflow ui

Note that the docker image must be rebuilt e.g. when any code or the pipeline.yaml changes.

Check the docs

Since this app uses FastAPI, visit http://localhost:8000/docs to checkout the API documentation and run requests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-end news classification

Repostitory structure

Setup

Docker (recommended)

Anaconda (not recommended)

Usage

Train and deploy a news classifier

Serve the news classifier

Offline experimentation (optional)

Check the docs

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data		data
notebooks		notebooks
src		src
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
environment.yaml		environment.yaml
pipeline.yaml		pipeline.yaml
requirements.txt		requirements.txt
run_ml.py		run_ml.py

maxhardt/news

Folders and files

Latest commit

History

Repository files navigation

End-to-end news classification

Repostitory structure

Setup

Docker (recommended)

Anaconda (not recommended)

Usage

Train and deploy a news classifier

Serve the news classifier

Offline experimentation (optional)

Check the docs

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages