- Experiments I've been running regarding ML models. In the "test models" directory you will find some explorations an test that I did. The same with the plots directory in which you will find some plots I ran in order to analyze the data
- In main.py you will find the core of the flask app. There you will find a route that consumes the previously trained model for predicting a heart disease.
- In models.py you will find the class that trains the models. This file will have modifications in the near future to adapt it to train models depending on if it is a classification or regression problem
A predictor for heart disease.
Using the heart.csv dataset attached in this repo. For training it was used bagging technique along with GridSearch with the purpose of finding the model and hyperparameters that gives the best performance.
The packages for this project are in the requirements.txt file so you just have to run pip install -r requirements.txt
The following lines are annotations that I did during the exploration and experimentation with the models
The first dataset in which we are working is heart.csv which contains the data to determinate if a patient has or not a heart disease. This is a supervised ML problem since it is a classification.
All datasets used in this project are the datasets directory. Those are kaggle datasets but a little modified in order to work better with our models. The first thing we are trying to do is to run a PCA algorith to determinate the features that are the best to predict a result
We used PCA algorithm to determinate how many features are eally useful to predict. We found that using 3 artificial features, generated by the PCA algorithm based on the original ones we got a score between 78% and 80%, depending if PCA or IPCA was used. The advantage of IPCA it's that is better for running in computers with limited resources
We used the KernelPA to project the data into a higher dimensional space. We observed that the performance was quite similar to the preovius PCA's
We load the data of the dataset felicidad.csv
Since what we are trying to predict is the happinnes score of people for the people fo a country giving certain metrics, it is convinient to use a Regression because it is a continuous variable that we are trying to predict We tried with three common regression models:
- Linear Regression
- Lasso Regression
- Ridge regression
To train the models we instantie each one of them, we pass the data as parameters. Alpha parameter is the inidicator of the penalty that the model will suffer for unnacurate predictions. Also, we implemented "mean squarred error" as our metric for the loss.
We observed that all of the models have a accuracy of above 90%, however, Linear seems to have a loss a lot smaller and Lasso gives tecnichally 0 relevance to the feature "corruption", opposed to Linear. Also, if we take a look carefully we can see that Linear takes every feature as almost equal important It seems that the best option for this problem is the Linear Model
** It is important to mention that Lasso and Ridge apply regularization to the data **
Robust regression algorithms allow to perform a good estimation even with outliers in our data, but we don't have to worry about dealing with them. These estimators do that work for us. in Robust.py there is an implementation of such models and in the folder robusts there are the plots with that compare the predicted values to the expected ones
- Hubber Regressor penalizes the atipical values found during training, giving them less important. To determinate if a value is atipic or not is if they overpass the given value epsilon. 1.35 is the value widely recommended
- Ransac train in little batches and assumes that they all are not atypical values. When this algorithm identifies the atypical values it gets them out of the training set for the model
A bagging classifier ir an ensemble meta-estimator that fits base clasiffiers with random subsets o the data, compare their individual predictions and choses a final predictions, this could be done by taking the mean of all predictions or by voting
For classification problems with no labeled data it is a good option to apply clusterization techniques.
Algotithms like KMeans or Meanshift are good options
We used the candy dataset to classify each candy in a cluster according to its propierties
KMEans can be a better choice you do now how many clusters you should have as result, when you don't know that information you may wan to explore Meanshift
Crossvalidation it is a better way to test and validate your models. There are two options for crossvalidation:
- K-Folds
- LOOCV
LOOCV it is a lot more intensive and requires much more resources than K-Folds. However, if you have limited data this could be a good choice