Skip to content

Built a few anomaly detection models to determine the anomalies from the data

License

Notifications You must be signed in to change notification settings

SFLazarus/Anomaly-Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Anomaly-Detector

Built a few anomaly detection models to determine the anomalies from the data

D(St)reams of Anomalies

The real world does not slow down for bad data

  1. Set up a data science project structure in a new git repository in your GitHub account
  2. Download the benchmark data set from:
  1. Load the one of the data set into panda data frames
  2. Formulate one or two ideas on how feature engineering would help the data set to establish additional value using exploratory data analysis
  3. Build one or more anomaly detection models to determine the anomalies using the other columns as features
  4. Document your process and results
  5. Commit your notebook, source code, visualizations and other supporting files to the git repository in GitHub

Data Description

This dataset (nyc_taxi) contains the number of NYC taxi passengers, where the five anomalies occur during the NYC marathon, Thanksgiving, Christmas, New Years day, and a snow storm. The raw data is from the NYC Taxi and Limousine Commission. The data file included here consists of aggregating the total number of taxi passengers into 30 minute buckets.

What we want to do?

  • Main goal: to detect anomalies from New york city taxi data
  • First we shall go through dataset and understand what features can we use to improve results of our anomaly detector.
  • We created new feature changeValue which stores the difference between two continuous samples' values.
  • We also created another feature moving_average, which stores moving averages of 5 continous samples

Results:

One Class SVM based Anomaly detector:

  • It has detected almost 29 outliers but we only had 5 actual ones, so it didn't perform well in our case.

Isolation Forest based Anomaly detector:

  • It has detected NYC marathon, Thanksgiving, Christmas, New Years day, overall it performed well.

Local Outlier Factor based Anomaly detector:

  • It has also detected NYC marathon, Thanksgiving, New Years day, this model has also performed well.

Conclusion and Further improvements:

  • Our models performed quite well on this data but am not sure if these models perform the best with other similar data too, it would only be fair to test on other data and evaluate its performance.
  • From this project, we see that Local Outlier Factor and Isolation Forest performed better than OneClassSVM model
  • In future, we can tweak certain parameters such as contamination which we used in this project and others to get the most of these models.

Project Structure:

Readme.md

  • Project description

Data

  • Contains link to dataset

Notebooks

  • Jupyter Notebook for Exploratory data analysis, Visualization, Feature Engineering and Anomaly Detection.

Reports

  • plot- change in values
  • Plot- raw data visualization
  • Plot- moving averages
  • Plot- OneClassSVM based Anomaly detector
  • plot- Local Outlier Factor based Anomaly detector
  • Plot- Isolation Forest based Anomaly detector

Requirements.txt

  • Info about Tools, frameworks and libraries required to reproduce the work flow

About

Built a few anomaly detection models to determine the anomalies from the data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published