Add Functions for Evaluating Target Variables in Predictive Modeling #57
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces a set of functions
_best_columns.py
designed to evaluate and identify the best target variable from a given dataset based on various statistical measures and predictive performance. The functions leverage correlation metrics, regression error metrics, feature importance, and mutual information to provide insights into the most relevant columns for predictive modeling.Functions
bestColumn_pearson_spearman( )
Calculates the Pearson and Spearman correlation coefficients between all columns in the provided DataFrame. It identifies the column with the highest average correlation (positive) with other columns using both correlation methods. Pearson measures linear relationships, while Spearman measures monotonic relationships.
bestColumn_with_least_mae_or_r2( )
Evaluates each column as a target variable for regression and calculates the Mean Absolute Error (MAE) and R-squared (R²) scores for predictions made by an XGBoost regressor. It identifies which column minimizes MAE and maximizes R², providing insights on the best target variable based on predictive performance.
bestColumn_feature_importance( )
Evaluates the importance of each feature by training an XGBoost regressor for each column and computing the average feature importance. This helps to identify which columns contribute most to predicting the target variable.
bestColumn_mutual_information( )
Calculates the mutual information scores between each column and the other columns in the DataFrame. Mutual information quantifies the amount of information obtained about one variable through the other, helping to determine which features are most informative for predicting the target variable.
You can view the new functions in Colab