Skip to content

In this little project I am practicing ML techniques to predict things like if a patients got a heart disease

Notifications You must be signed in to change notification settings

joethecoderr/ml_predictions

Repository files navigation

What will you find in this repo?

- Experiments I've  been running regarding ML models. In the "test models" directory you will find some explorations an test that I did. The same with the plots directory in which you will find some plots I ran in order to analyze the data
- In main.py you will find the core of the flask app. There you will find a route that consumes the previously trained model for predicting a heart disease. 
- In models.py you will find the class that trains the models. This file will have modifications in the near future to adapt it to train models depending on if it is a classification or regression problem

What is currently working?

A predictor for heart disease.

How it was trained?

Using the heart.csv dataset attached in this repo. For training it was used bagging technique along with GridSearch with the purpose of finding the model and hyperparameters that gives the best performance.

The packages for this project are in the requirements.txt file so you just have to run pip install -r requirements.txt

run_app.sh it is a shortcut to activate the venv (I advice you to name it venv) and run the app

The following lines are annotations that I did during the exploration and experimentation with the models

The first dataset in which we are working is heart.csv which contains the data to determinate if a patient has or not a heart disease. This is a supervised ML problem since it is a classification.

Heart Disease

1 - Loading the data for predicting a heart Disease

All datasets used in this project are the datasets directory. Those are kaggle datasets but a little modified in order to work better with our models. The first thing we are trying to do is to run a PCA algorith to determinate the features that are the best to predict a result

2 - USING PCA

We used PCA algorithm to determinate how many features are eally useful to predict. We found that using 3 artificial features, generated by the PCA algorithm based on the original ones we got a score between 78% and 80%, depending if PCA or IPCA was used. The advantage of IPCA it's that is better for running in computers with limited resources

3 - USING KPCA

We used the KernelPA to project the data into a higher dimensional space. We observed that the performance was quite similar to the preovius PCA's

Happiness

1 - Loading the data

We load the data of the dataset felicidad.csv

2 - Selection of models

Since what we are trying to predict is the happinnes score of people for the people fo a country giving certain metrics, it is convinient to use a Regression because it is a continuous variable that we are trying to predict We tried with three common regression models:

- Linear Regression
- Lasso Regression
- Ridge regression

3 - Implementation of models

To train the models we instantie each one of them, we pass the data as parameters. Alpha parameter is the inidicator of the penalty that the model will suffer for unnacurate predictions. Also, we implemented "mean squarred error" as our metric for the loss.

4 - Conclusions

We observed that all of the models have a accuracy of above 90%, however, Linear seems to have a loss a lot smaller and Lasso gives tecnichally 0 relevance to the feature "corruption", opposed to Linear. Also, if we take a look carefully we can see that Linear takes every feature as almost equal important It seems that the best option for this problem is the Linear Model

** It is important to mention that Lasso and Ridge apply regularization to the data **

Happines with Robust regression

Robust regression algorithms allow to perform a good estimation even with outliers in our data, but we don't have to worry about dealing with them. These estimators do that work for us. in Robust.py there is an implementation of such models and in the folder robusts there are the plots with that compare the predicted values to the expected ones

- Hubber Regressor penalizes the atipical values found during training, giving them less important. To determinate if a value is atipic or not is if they overpass the given value epsilon. 1.35 is the value widely recommended
- Ransac train in little batches and assumes that they all are not atypical values. When this algorithm identifies the atypical values it gets them out of the training set for the model

Heart disease with bagging techniques

A bagging classifier ir an ensemble meta-estimator that fits base clasiffiers with random subsets o the data, compare their individual predictions and choses a final predictions, this could be done by taking the mean of all predictions or by voting

Clustering

For classification problems with no labeled data it is a good option to apply clusterization techniques.
Algotithms like KMeans or Meanshift are good options We used the candy dataset to classify each candy in a cluster according to its propierties KMEans can be a better choice you do now how many clusters you should have as result, when you don't know that information you may wan to explore Meanshift

Crossvalidation

Crossvalidation it is a better way to test and validate your models. There are two options for crossvalidation:
- K-Folds - LOOCV LOOCV it is a lot more intensive and requires much more resources than K-Folds. However, if you have limited data this could be a good choice

About

In this little project I am practicing ML techniques to predict things like if a patients got a heart disease

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published