- Projects
- Predictive Text ❗ Lastest project!
- Dog Breed Detector
- Bulldozer Price Prediction
- Heart Disease Prediction
- Titanic
- Resources I am going through
[Unstructured data] [Natural Language Processing] [R] [Last update: 01/10/20]
This is the Data Science Capstone project of the Data Science Specialization on Coursera. Using R, the goal of the capstone project was to create an algorithm for predicting the next word given one or more words (a phrase/sentence) as input. Natural language processing techniques were be used to perform an analysis and to build a predictive model. A large corpus of more than 4 million documents was loaded, sampled, tokenized and analyzed. N-grams (1 to 4) were extracted from the corpus and then used for building the predictive model.
The Shiny application can be visualized here
You can see a RMarkdown Report here
[Unstructured data] [Multiple Categories Problem] [Python] [Last update: 24/06/20]
In this project I'll use deep neural networks to create a classifier capable of identifying the breed of a dog given a photo of a dog. I use data from the Kaggle dog breed identification competition, in which the submissions are evaluated on Multi Class Log Loss between the predicted probability and the observed target. The model used, mobilenet_v2_130_224 reached a Log Loss of ≈0.97 when submitting the test dataset predictions to Late Submission. I went through:
- Preprocessing Images (turning them into Tensors)
- Creating data batches
- Creating, training, and evaluating a model
- Making predictions on test dataset and custom images
Note: There is still work to do in this project
[Structured data] [Regression Problem] [Python] [Last update: 19/06/20]
Using data from an old Kaggle competition, I try to predict the future sale price of a bulldozer, given its characteristics and previous examples of how much similar bulldozers have been sold for. The evaluation metric used was the RMSLE (root mean squared log error) between the actual and predicted auction prices. I went through:
- Preprocessing of data / Feature Engineering
- Building a Random Forest Regressor model
- Hyerparameter tuning with RandomizedSearchCV
- Feature Importance, followed by exploratory data analysis
Note: went through some problems with the RMSLE formula
[Structured data] [Classification Problem] [Python] [Last update: 16/06/20]
Using data from UCI Machine Learning Repository, we try to we predict wheter or not a patient could have heart disease given his/her clinical parameters. The model, a Random Forest Classifier, reached an accuracy, precision, recall, and F1 score of ≈80%. I went through:
- Exploratory data analysis (EDA)
- Model training (Logistic Regression, KNeighborsClassifier, RandomForestClassifier)
- Model evaluation (ROC curve and AUC score, Confusion matrix, Classification report)
- Model comparison
- Model fine-tuning (with RandomizedSearchCV and GridSearchCV)
- Cross-validation
- Feature importance
[Structured data] [Classification Problem] [Python] [Last update: 10/05/20]
This is a competition in Kaggle, the first challenge for newcommers to dive into ML competitions and familiarize themselves with how the platform works. The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. Link to the competition overview: here. Here I use Pandas, Seaborn, y scikit-learn. This project consisted of:
- visual analysis of data
- feature engineering
- initializing and training of various prediction models (highest prob.: 0.830527, Support Vector Machine)
- hyperparameter tuning using grid search (highest prob: 0.833895, Random Forest Classifier)
- outputing and submiting predictions to Kaggle platform (score: 0.78947)
Legend: ☑️ Currently doing ✅ Finished
- ✅ Data Science Specialization by John Hopkins University
- ✅ Complete Python Developer in 2020: Zero to Mastery
- Certification: here
- ✅ Complete Machine Learning and Data Science: Zero to Mastery
- Certification: here
- Statistics 110: Probability
- Fundamentals of Statistics
- ☑️ McKinney, W. (2017). Python for Data Analysis
- Myatt, G. J. & Johnson, W. P. (2014). Making sense of data I, A practical Guide to Exploratory Data Analysis and Data Mining
- Myatt, G. J. & Johnson, W. P. (2009). Making sense of data II, A practical guide to data visualization, advanced data mining methods, and applications
- Müller, A. C. & Guido S. (2016). Introduction to Machine Learning with Python
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow