Data science exercises

This is the Data Science Capstone project of the Data Science Specialization on Coursera. Using R, the goal of the capstone project was to create an algorithm for predicting the next word given one or more words (a phrase/sentence) as input. Natural language processing techniques were be used to perform an analysis and to build a predictive model. A large corpus of more than 4 million documents was loaded, sampled, tokenized and analyzed. N-grams (1 to 4) were extracted from the corpus and then used for building the predictive model.

The Shiny application can be visualized here

You can see a RMarkdown Report here

Dog Breed Detector

[Unstructured data] [Multiple Categories Problem] [Python] [Last update: 24/06/20]

In this project I'll use deep neural networks to create a classifier capable of identifying the breed of a dog given a photo of a dog. I use data from the Kaggle dog breed identification competition, in which the submissions are evaluated on Multi Class Log Loss between the predicted probability and the observed target. The model used, mobilenet_v2_130_224 reached a Log Loss of ≈0.97 when submitting the test dataset predictions to Late Submission. I went through:

- Preprocessing Images (turning them into Tensors)
- Creating data batches
- Creating, training, and evaluating a model
- Making predictions on test dataset and custom images

Note: There is still work to do in this project

Bulldozer Price Prediction

[Structured data] [Regression Problem] [Python] [Last update: 19/06/20]

Using data from an old Kaggle competition, I try to predict the future sale price of a bulldozer, given its characteristics and previous examples of how much similar bulldozers have been sold for. The evaluation metric used was the RMSLE (root mean squared log error) between the actual and predicted auction prices. I went through:

- Preprocessing of data / Feature Engineering
- Building a Random Forest Regressor model
- Hyerparameter tuning with RandomizedSearchCV
- Feature Importance, followed by exploratory data analysis

Note: went through some problems with the RMSLE formula

Heart Disease Prediction

[Structured data] [Classification Problem] [Python] [Last update: 16/06/20]

Using data from UCI Machine Learning Repository, we try to we predict wheter or not a patient could have heart disease given his/her clinical parameters. The model, a Random Forest Classifier, reached an accuracy, precision, recall, and F1 score of ≈80%. I went through:

- Exploratory data analysis (EDA)
- Model training (Logistic Regression, KNeighborsClassifier, RandomForestClassifier)
- Model evaluation (ROC curve and AUC score, Confusion matrix, Classification report)
- Model comparison
- Model fine-tuning (with RandomizedSearchCV and GridSearchCV)
- Cross-validation
- Feature importance

Titanic

[Structured data] [Classification Problem] [Python] [Last update: 10/05/20]

This is a competition in Kaggle, the first challenge for newcommers to dive into ML competitions and familiarize themselves with how the platform works. The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. Link to the competition overview: here. Here I use Pandas, Seaborn, y scikit-learn. This project consisted of:

- visual analysis of data
- feature engineering
- initializing and training of various prediction models (highest prob.: 0.830527, Support Vector Machine)
- hyperparameter tuning using grid search (highest prob: 0.833895, Random Forest Classifier)
- outputing and submiting predictions to Kaggle platform (score: 0.78947)

Resources I am going through

Legend: ☑️ Currently doing ✅ Finished

Courses

✅ Data Science Specialization by John Hopkins University
- Repository with all course assignments: here
- All certifications of the course: here
✅ Complete Python Developer in 2020: Zero to Mastery
- Certification: here
✅ Complete Machine Learning and Data Science: Zero to Mastery
- Certification: here
Statistics 110: Probability
Fundamentals of Statistics

Video series

Books

☑️ McKinney, W. (2017). Python for Data Analysis
Myatt, G. J. & Johnson, W. P. (2014). Making sense of data I, A practical Guide to Exploratory Data Analysis and Data Mining
Myatt, G. J. & Johnson, W. P. (2009). Making sense of data II, A practical guide to data visualization, advanced data mining methods, and applications
Müller, A. C. & Guido S. (2016). Introduction to Machine Learning with Python
Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
Certificates		Certificates
Deep Learning/dog-breed-identification		Deep Learning/dog-breed-identification
Machine Learning		Machine Learning
Predictive Text Project		Predictive Text Project
data		data
shiny-dashboards/titanic		shiny-dashboards/titanic
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data science exercises

Table of Contents

Projects

Predictive Text

Dog Breed Detector

Bulldozer Price Prediction

Heart Disease Prediction

Titanic

Resources I am going through

Courses

Video series

Books

About

Releases

Packages

Languages

JAMorello/data-science-exercises

Folders and files

Latest commit

History

Repository files navigation

Data science exercises

Table of Contents

Projects

Predictive Text

Dog Breed Detector

Bulldozer Price Prediction

Heart Disease Prediction

Titanic

Resources I am going through

Courses

Video series

Books

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages