Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Functions for Evaluating Target Variables in Predictive Modeling #57

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Reinaldo-Kn
Copy link

This pull request introduces a set of functions _best_columns.py designed to evaluate and identify the best target variable from a given dataset based on various statistical measures and predictive performance. The functions leverage correlation metrics, regression error metrics, feature importance, and mutual information to provide insights into the most relevant columns for predictive modeling.

Functions

  • bestColumn_pearson_spearman( )

Calculates the Pearson and Spearman correlation coefficients between all columns in the provided DataFrame. It identifies the column with the highest average correlation (positive) with other columns using both correlation methods. Pearson measures linear relationships, while Spearman measures monotonic relationships.

Parameters:
      df (pd.DataFrame): The input DataFrame containing the dataset.
  Returns:
      dict: A dictionary containing the best column for each correlation method:
          pearson: The column with the highest average Pearson correlation.
          spearman: The column with the highest average Spearman correlation.
  • bestColumn_with_least_mae_or_r2( )

Evaluates each column as a target variable for regression and calculates the Mean Absolute Error (MAE) and R-squared (R²) scores for predictions made by an XGBoost regressor. It identifies which column minimizes MAE and maximizes R², providing insights on the best target variable based on predictive performance.


    Parameters:
        df (pd.DataFrame): The input DataFrame containing the dataset.
    Returns:
        dict: A dictionary with sorted results for MAE and R²:
            mae: Sorted list of columns minimizing Mean Absolute Error.
            r2: Sorted list of columns maximizing R-squared.
  • bestColumn_feature_importance( )

Evaluates the importance of each feature by training an XGBoost regressor for each column and computing the average feature importance. This helps to identify which columns contribute most to predicting the target variable.

    Parameters:
        df (pd.DataFrame): The input DataFrame containing the dataset.
    Returns:
        list: A sorted list of feature importances for each column, indicating their contribution to predictive modeling.

  • bestColumn_mutual_information( )

Calculates the mutual information scores between each column and the other columns in the DataFrame. Mutual information quantifies the amount of information obtained about one variable through the other, helping to determine which features are most informative for predicting the target variable.

    Parameters:
        df (pd.DataFrame): The input DataFrame containing the dataset.
    Returns:
        list: A sorted list of mutual information scores for each column, providing insights into their informational value.

You can view the new functions in Colab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants