Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Grid Search Functionality for Best Hyperparameter Tuning #52

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Reinaldo-Kn
Copy link

@Reinaldo-Kn Reinaldo-Kn commented Oct 16, 2024

The GridSearchModel class encapsulates the process of hyperparameter tuning using GridSearchCV, providing a modular and flexible interface for model selection and training. It includes methods for fitting a model, predicting, saving/loading models, and additional utility functions.

Parameters

model: sklearn estimator, optional (default=RandomForestRegressor())
    The machine learning model to be tuned. This can be any model compatible with GridSearchCV.

param_grid: dict, optional (default={'n_estimators': [10, 100], 'max_depth': [None, 10 ], 'min_samples_split': [2, 4 ]})
    A dictionary containing the hyperparameters and their respective ranges for the grid search. The grid will search through all possible combinations of these hyperparameters.

scoring: str, optional (default='neg_mean_absolute_error')
    The scoring metric used for evaluating the models during grid search. This should be a valid scoring metric recognized by scikit-learn.

cv: int, optional (default=5)
    The number of cross-validation folds to be used during the grid search.

test_size: float, optional (default=0.01)
    The proportion of the dataset to use as the test set. The default is 1% of the data.

Methods

  • fit(self, df, target_column)
    Fits the model using grid search with the specified dataset and target variable.
    Parameters:
        df (pandas.DataFrame): The input dataset containing both features and the target variable.
        target_column (str): The name of the column that represents the target variable (the variable to predict).
    Returns:
        best_estimator_ (sklearn estimator): The model fitted with the best hyperparameters found during the grid search.
  • predict(self, X)

Makes predictions using the best model found by grid search.

    Parameters:
        X (pandas.DataFrame or numpy.ndarray): Input data to make predictions on.
    Returns:
        y_pred (numpy.ndarray): The predicted values.
  • score(self, X_test, y_test)

Evaluates the performance of the best model on a test dataset.

    Parameters:
        X_test (pandas.DataFrame or numpy.ndarray): Features of the test set.
        y_test (pandas.Series or numpy.ndarray): True values of the test set.
    Returns:
        score (float): The performance score of the best model on the test set.
  • get_best_params(self)

Returns the best hyperparameters found by the grid search.

    Returns:
        best_params (dict): Dictionary of the best hyperparameters.
  • save_model(self, filename)

Saves the best model to a file.

    Parameters:
        filename (str): The path where the model should be saved.
    Returns:
        None
  • load_model(self, filename)

Loads a previously saved model from a file.

    Parameters:
        filename (str): The path where the model is saved.
    Returns:
        None
  • plot_feature_importance(self, feature_names)

Plots the feature importances from the best model. Only works for models that have the feature_importances_ attribute, such as RandomForest.

    Parameters:
        feature_names (list): List of feature names in the same order as they appear in the dataset.
    Returns:
        None
  • cross_val_score_summary(self, X, Y)

Generates a summary of cross-validation scores for the best model.

    Parameters:
        X (pandas.DataFrame): The features for cross-validation.
        Y (pandas.Series): The target variable for cross-validation.
    Returns:
        summary (dict): A dictionary containing the mean and standard deviation of the cross-validation scores, along with individual fold scores.

Example Usage

from sklearn.ensemble import RandomForestRegressor
from _gridsearch import GridSearchModel

# Initialize the model with default or custom hyperparameters
grid_model = GridSearchModel(
    model=RandomForestRegressor(),
    param_grid={'n_estimators': [50, 100], 'max_depth': [None, 10]},
    scoring='neg_mean_squared_error',
    cv=5,
    test_size=0.2
)

# Fit the model to the dataset
best_model = grid_model.fit(df, 'target_column')

# Get the best hyperparameters
best_params = grid_model.get_best_params()
print("Best Parameters: ", best_params)

# Predict on new data
y_pred = grid_model.predict(X_new)

# Evaluate the model on the test set
test_score = grid_model.score(X_test, y_test)
print("Test Score: ", test_score)

# Save the model
grid_model.save_model('best_model.joblib')

# Plot feature importance
grid_model.plot_feature_importance(feature_names)

# Get cross-validation summary
cv_summary = grid_model.cross_val_score_summary(X, Y)
print("CV Mean Score: ", cv_summary['mean_score'])
print("CV Standard Deviation: ", cv_summary['std_dev'])

You can view the new functions in Colab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants