This project introduces a low-code machine learning platform designed to streamline the training and deployment of models, making it accessible to users with limited technical expertise. The platform features a user-friendly interface for easy model training and deployment, catering to both novices and experienced users by offering customization options. It includes robust features such as model versioning, performance analytics, and data preprocessing, all aimed at simplifying the machine learning process while providing comprehensive insights and flexibility.
Link to demo video: https://drive.google.com/drive/folders/1TcIyn4QQDKZfMlI27sqQu95duF-Dwxjc?usp=sharing
All the APIs are present in the APIs
folder.
Description of all files & APIs in it:
-
consists of a single endpoint which is used to deploy a model on docker. The dockerfile is present in the backend folder.
-
/deployModel
POST
methodTakes in the model id and version number as input and deploys the model on docker by first building the image and then running the container. The docker container is exposed on
port 8080
. Internally, another flask server (app2) is running on the docker container which is exposed onport 5000
. The flask server on the docker container is used to make predictions on the deployed model.The request body for this API endpoint would be a JSON object. Here's the structure:
{ "model_id": "string or integer", "version": "string or integer" }
It returns a JSON object which contains the status of the deployment.
Return Object:
{ 'status': 'success' or 'failure' }
-
-
Consists of endpoint which performs exploratory data analysis on the dataset.
-
/eda/<dataset_id>
GET
methodTakes in the dataset ID as a parameter in the URL and performs EDA on the corresponding dataset. The EDA includes operations like computing the mean, median, standard deviation, minimum and maximum values for numerical columns, and the number of unique values and missing values for all columns. The results of the EDA are returned in the response body.
The response body for this API endpoint is a JSON object. Here's the structure:
{ 'message': 'string', 'data': 'list of dictionaries representing rows of the dataset', 'columns': 'list of column names', 'column_details': 'dictionary with details for each column' }
The column_details dictionary for each column includes the column index, column type (numerical or categorical), and various statistics depending on the column type. For numerical columns, it includes the mean, standard deviation, median, minimum and maximum values, number of missing values, and range. For categorical columns, it includes the number of unique values and number of missing values.
Example Return Object:
{ 'message': 'EDA completed successfully', 'data': [{'column1': 'value1', 'column2': 'value2', ...}, ...], 'columns': ['column1', 'column2', ...], 'column_details': { 'column1': { 'index': 0, 'column_type': 'numerical', 'mean': 50.0, 'std_dev': 10.0, 'median': 50.0, 'min': 0, 'max': 100, 'num_missing_values': 0, 'range': 100 }, 'column2': { 'index': 1, 'column_type': 'categorical', 'num_unique_values': 5, 'num_missing_values': 0 }, ... } }
-
-
Consists of dataset related queries on the database
-
/getDatasets
GET
methodReturns information of all datasets present in the database as a list. Each list is an object whose structure was given earlier.
Return object:
{ 'dataset_list' : dataset_list }
dataset_list
is a list of objects of type dataset. -
/getDatasetInfo/<dataset_id>
GET
methodReturns information about the dataset having id as
dataset_id
.Return Object:
dataset
It directly returns an object of the type dataset which corresponds to the details of the dataset with dataset id
dataset_id
. -
/getDatasets/<dataset_id>
GET
methodReturns the dataset file corressponding to dataset id
dataset_id
. -
/getDatasets/columns/<model_name>
GET
methodReturns the column names of the dataset on which the model corresponding to
model_name
was trained.Return Object:
{ 'target_column': target_column_name, 'non_target_columns': other_column_names_list }
-
-
Consists of queries related to fetching information about already trained models in the database.
-
/getTrainedModels
-
GET
methodReturns information of all models present in the database as a list. Each list is an object whose structure is of type Model.
Return object:
{ 'trained_models': trained_model_list }
trained_model_list
is a list of objects of type Model. -
/getTrainedModels/<model_id>
GET
methodReturns information about the model having id as
model_id
.Return Object:
Model
It directly returns an object of the type Model which corresponds to the details of the model with model id
model_id
. -
/getTrainedModelFile/<model_id>/<version>
GET
methodReturns the pickle file of the model with model_id
model_id
and version number asversion
-
-
This script provides endpoints for making predictions using trained models. The models are identified by their IDs and are expected to be stored in the
backend/Models/
directory.-
/inference/single
POST
methodThis endpoint takes in a model ID and a list of input values, makes a prediction using the specified model, and returns the prediction. The model ID and input values are expected to be provided in the request body as a JSON object with the following structure:
{ "model_id": "string or integer", "input_values": ["value1", "value2", ...] }
The endpoint returns a JSON object with the prediction, or an error message if the model is not found or the input values are invalid.
-
/inference/batch
POST
methodThis endpoint takes in a model ID and a CSV file, makes predictions for each row in the CSV file using the specified model, and returns a CSV file with the predictions appended as a new column. The model ID is expected to be provided in the request body as a JSON object with the following structure:
{ "model_id": "string or integer" }
The CSV file is expected to be uploaded as a file with the key 'file'. The endpoint returns a CSV file with the predictions, or an error message if the model is not found, no file is uploaded, or the file format is invalid.
-
-
This script provides an endpoint for preprocessing a specified dataset. The dataset is identified by its ID and is expected to be in CSV format in the
backend/Datasets/
directory.-
/preprocess
POST
methodThis endpoint takes in a dataset ID and a list of preprocessing tasks, performs the specified tasks on the dataset, and saves the preprocessed dataset in the
./processedDatasets/
directory. The dataset ID and preprocessing tasks are expected to be provided in the request body as a JSON object with the following structure:{ "dataset_id": "string or integer", "preprocessing_steps": ["task1", "task2", ...] }
The available tasks are "Drop Duplicate Rows", "Interpolate Missing Values", and "Normalise Features". If "Drop Duplicate Rows" is specified, the endpoint removes all duplicate rows from the dataset. If "Interpolate Missing Values" is specified, the endpoint fills in missing values in numerical columns with the column mean and in non-numerical columns with the most frequent value. If "Normalise Features" is specified, the endpoint normalizes all numerical columns to have a mean of 0 and a standard deviation of 1.
The endpoint returns a JSON object with a success message and some statistics about the preprocessing operation, including the number of duplicate rows dropped, the number of missing values interpolated, and the list of columns normalized. The preprocessed dataset is saved as a CSV file in the
./processedDatasets/
directory with the same ID as the original dataset.Example Return Object:
{ 'message': 'Preprocessing completed successfully', 'num_duplicate': 5, 'num_interpolate': 10, 'normalized_columns': ['column1', 'column2', ...] }
-
-
Consists of queries related to storing new dataset in the database.
-
/storeDataset
POST
methodTakes in a request form object in body and saves the dataset info like name, size etc. in the database and the actual dataset file on local disk.
The request body for this API endpoint would be a form-data type. Here's the structure:
{ "file": (binary), "filename": (string), "filesize": (string) }
file
: This is the file that you want to upload. It should be sent as binary data.filename
: This is the name of the file. It should be sent as a string.filesize
: This is the size of the file. It should also be sent as a string.
Return Object:
{ 'status': 'success', 'dataset_id': dataset_id, 'dataset_name': dataset_name }
-
-
Consists of endpoint which trains a new model.
-
/trainModel
POST
methodTakes in all the fields necessary for training a model and trains a new model accordingly. After training, it stores the model info in database and model pickle file on local disk system.
The request body for the API-endpoint is a JSON object:
{ "dataset_id": "string", "model_name": "string", "target_column": "string", "objective": "string", "metric_mode": "string", "metric_type": "string", "training_mode": "string", "model_type": "string", "isUpdate": "string", "hyperparameters": "object", "model_id": "string or integer" }
It calls the appropriate subprocess depending on the fields and returns an object of the type Model containing all the details of the model trained.
-
-
Consists of endpoints related to updating a model. We can update model in 3 ways: Training on additional data, changing the hyperparameters, or changing the type of the estimator. The newly trained model is stored as a new version of the existing model.
-
/trainOnMoreData
POST
methodThis method takes the new dataset file and the model details to train a new model on the concatenated data (concatenation of original data and the new data).
The request body for this endpoint would be a form-data type.
{ "original_dataset_id": "string", "file": (binary), "model_name": "string", "target_column": "string", "objective": "string", "metric_mode": "string", "metric_type": "string", "model_type": "string", "model_id": "string or integer" }
It internally calls the
/trainModel
API with appropriate parameters. It returns a object which contains information about the new version of the model. -
/changeHyperparameters
POST
methodIt takes in the new hyperparameters selected for the model and trains it on the same dataset with the new hyperparameters.
The request body for the API-endpoint is a JSON object:
{ "model_details": { "dataset_id": "string or integer", "model_id": "string or integer", "model_name": "string", "target_column": "string", "objective": "string", "metric_mode": "string", "metric_type": "string", "estimator_type": "string" }, "new_hyperparameters": "object" }
It internally calls the
/trainModel
API with appropriate parameters.It returns a object which contains information about the new version of the model.
-
/changeEstimatorType
POST
methodIt takes in the new estimator type for current model and trains it on the same dataset with the changed estimator. It also takes the modified hyperparameters (if modified by the user).
The request body for the API-endpoint is a JSON object:
{ "model_details": { "dataset_id": "string or integer", "model_id": "string or integer", "model_name": "string", "target_column": "string", "objective": "string", "metric_mode": "string", "metric_type": "string" }, "new_hyperparameters": "object", "estimator_type": "string" }
It internally calls the
/trainModel
API with appropriate parameters.It returns a object which contains information about the new version of the model.
-
-
Contains endpoints that provide utility information such as Classifiers, regressors, parameters for a classifier etc.
-
/getAllClassifiers
GET
methodReturns the list of classifiers available to train.
-
/getAllRegressors
GET
methodReturns the list of regressors available to train.
-
/getClassifierMap
GET
methodReturns the classifier map object which maps a classifier's name to its corressponding sklearn classifier's name.
-
/getRegressorMap
GET
methodReturns the regressor map object which maps a regressor's name to its corressponding sklearn regressor's name.
-
/getHyperparameters
POST
methodThe request body would be a JSON object:
{ "estimator_name": "string" }
Returns the object which contains hyperparameters for the given estimator.
-
It consists of functions that are used by API files in APIs
folder.
Description of all files in it:
-
Contains the class
ClassificationUtility
which is used to train models bytrainModelAutoML.py
andtrainModelCustom.py
.It is designed to facilitate the process of training and evaluating various classification models on a given dataset. It includes methods for data preprocessing, model training, evaluation, and saving the trained model.
Key features of this class include:
-
It supports both AutoML (automated machine learning) and custom training modes. In AutoML mode, it trains multiple classifiers and selects the best one based on a specified metric. - In custom mode, it trains a specified model with given hyperparameters.
-
It handles categorical and numerical features separately during preprocessing. Categorical features with high cardinality are target encoded, while others are one-hot encoded. Numerical features are scaled.
-
It calculates various evaluation metrics such as accuracy, precision, recall, F1 score, and AUC (Area Under the ROC Curve) for the trained models.
-
It provides methods to get feature importance, precision-recall data, and AUC data.
-
It allows saving the best model to a specified path.
-
-
Contains the class
RegressionUtility
which is used to train models bytrainModelAutoML.py
andtrainModelCustom.py
.It is designed to facilitate the process of training and evaluating various regression models on a given dataset. It includes methods for data preprocessing, model training, evaluation, and saving the trained model.
Key features of this class include:
-
It supports both AutoML (automated machine learning) and custom training modes. In AutoML mode, it trains multiple regressors and selects the best one based on a specified metric. In custom mode, it trains a specified model with given hyperparameters.
-
It handles categorical and numerical features separately during preprocessing. Categorical features with high cardinality are target encoded, while others are one-hot encoded. Numerical features are scaled.
-
It calculates various evaluation metrics such as R2 score, Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) for the trained models.
-
It provides methods to get feature importance, scatter plot data, and residual plot data.
-
It allows saving the best model to a specified path.
-
-
Contains the function which trains a model for classification/regression in AutoML mode i.e. it trains all the available classifiers on the dataset and stores the results/metrics in the database.
Brief outline of working of the function:
- It reads the dataset from a CSV file.
- If the metric_mode is set to 'autoselect', it automatically selects an appropriate evaluation metric based on the task (classification or regression) and the balance of the target classes (for classification).
- It initializes a ClassificationUtility or RegressionUtility object based on the task.
- It trains the model using AutoML.
- It retrieves the best model and its parameters based on the specified or auto-selected metric.
- It prepares a list of evaluation metrics and model parameters.
- It saves the best model to a file.
- It generates various visualization data such as confusion matrix, feature importance, precision-recall curve, ROC curve, scatter plot, and residual plot.
- It saves the model metadata including training time, model ID, model name, training mode, estimator type, metric mode, metric type, saved model path, dataset ID, objective, target column, parameters, evaluation metrics, all models' results, input schema, output schema, output mapping, and graph data to a MongoDB collection named 'Model_zoo'.
- It returns the model metadata as a dictionary. The function is designed to be run from the command line with arguments for dataset ID, model name, model type, hyperparameters, target column, metric mode, metric type, objective, model ID, and update status.
-
It provides a similar functionality as
trainModelAutoML.py
with the only difference being it trains a given model instead of training all available models.
The enums folder contains enums.py
file which contains enumerations for metrics, classifiers and regressors.
It is used to store the .csv
files of the dataset on the local disk system.
It is used to store the .pkl
files of trained models on the local disk system.
Worked under the guidance of Dr. Karthik Vaidhyanathan