Skip to content

Folder with data science exercises, updated as I study and practice

Notifications You must be signed in to change notification settings

JAMorello/data-science-exercises

Repository files navigation

Data science exercises

Banner

Table of Contents

Projects

[Unstructured data] [Natural Language Processing] [R] [Last update: 01/10/20]

This is the Data Science Capstone project of the Data Science Specialization on Coursera. Using R, the goal of the capstone project was to create an algorithm for predicting the next word given one or more words (a phrase/sentence) as input. Natural language processing techniques were be used to perform an analysis and to build a predictive model. A large corpus of more than 4 million documents was loaded, sampled, tokenized and analyzed. N-grams (1 to 4) were extracted from the corpus and then used for building the predictive model.

The Shiny application can be visualized here

You can see a RMarkdown Report here

[Unstructured data] [Multiple Categories Problem] [Python] [Last update: 24/06/20]

In this project I'll use deep neural networks to create a classifier capable of identifying the breed of a dog given a photo of a dog. I use data from the Kaggle dog breed identification competition, in which the submissions are evaluated on Multi Class Log Loss between the predicted probability and the observed target. The model used, mobilenet_v2_130_224 reached a Log Loss of ≈0.97 when submitting the test dataset predictions to Late Submission. I went through:

- Preprocessing Images (turning them into Tensors)
- Creating data batches
- Creating, training, and evaluating a model
- Making predictions on test dataset and custom images

Note: There is still work to do in this project

[Structured data] [Regression Problem] [Python] [Last update: 19/06/20]

Using data from an old Kaggle competition, I try to predict the future sale price of a bulldozer, given its characteristics and previous examples of how much similar bulldozers have been sold for. The evaluation metric used was the RMSLE (root mean squared log error) between the actual and predicted auction prices. I went through:

- Preprocessing of data / Feature Engineering
- Building a Random Forest Regressor model
- Hyerparameter tuning with RandomizedSearchCV
- Feature Importance, followed by exploratory data analysis

Note: went through some problems with the RMSLE formula

[Structured data] [Classification Problem] [Python] [Last update: 16/06/20]

Using data from UCI Machine Learning Repository, we try to we predict wheter or not a patient could have heart disease given his/her clinical parameters. The model, a Random Forest Classifier, reached an accuracy, precision, recall, and F1 score of ≈80%. I went through:

- Exploratory data analysis (EDA)
- Model training (Logistic Regression, KNeighborsClassifier, RandomForestClassifier)
- Model evaluation (ROC curve and AUC score, Confusion matrix, Classification report)
- Model comparison
- Model fine-tuning (with RandomizedSearchCV and GridSearchCV)
- Cross-validation
- Feature importance

[Structured data] [Classification Problem] [Python] [Last update: 10/05/20]

This is a competition in Kaggle, the first challenge for newcommers to dive into ML competitions and familiarize themselves with how the platform works. The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. Link to the competition overview: here. Here I use Pandas, Seaborn, y scikit-learn. This project consisted of:

- visual analysis of data
- feature engineering
- initializing and training of various prediction models (highest prob.: 0.830527, Support Vector Machine)
- hyperparameter tuning using grid search (highest prob: 0.833895, Random Forest Classifier)
- outputing and submiting predictions to Kaggle platform (score: 0.78947)

Resources I am going through

Legend: ☑️ Currently doing ✅ Finished

Courses

Video series

Books

  • ☑️ McKinney, W. (2017). Python for Data Analysis
  • Myatt, G. J. & Johnson, W. P. (2014). Making sense of data I, A practical Guide to Exploratory Data Analysis and Data Mining
  • Myatt, G. J. & Johnson, W. P. (2009). Making sense of data II, A practical guide to data visualization, advanced data mining methods, and applications
  • Müller, A. C. & Guido S. (2016). Introduction to Machine Learning with Python
  • Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

About

Folder with data science exercises, updated as I study and practice

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published