Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added functions for outlier detection and handling #53

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

castrokelly
Copy link

@castrokelly castrokelly commented Oct 17, 2024

Added functions for outlier detection and handling

This pull request adds two new functions to the bibmon.PreProcess class for outlier detection and handling:

  • detect_outliers_iqr(df, cols):

    • Detects outliers in the specified columns of a DataFrame using the Interquartile Range (IQR) method.
    • Returns a DataFrame with the same shape as the input, where outliers are flagged with 1 and other points with 0.
    • This function can be used to identify potential outliers in the data before applying a machine learning model.
  • remove_outliers(df, cols, method='remove'):

    • Removes or handles outliers in the specified columns of a DataFrame using the IQR method.
    • Offers three methods for handling outliers:
      • remove: Removes outliers from the DataFrame.
      • median: Replaces outliers with the median value of the column.
      • winsorize: Applies winsorization to limit extreme values.
    • Returns a DataFrame with the outliers removed or handled.
    • This function can be used to preprocess the data and improve the robustness of machine learning models.

Motivation:

Outliers can significantly affect the performance of machine learning models, especially those used for anomaly detection. By detecting and handling outliers, we can improve the accuracy and reliability of the models, which is crucial for the effective use of BibMon with real-world datasets like the 3W Dataset, known for its diverse and potentially noisy data.

Benefits:

  • Improved model accuracy
  • Reduced false alarms
  • Increased efficiency in data preprocessing

Example usage:

from bibmon import PreProcess

# Create a PreProcess object
preprocessor = PreProcess(f_pp=['detect_outliers_iqr', 'remove_outliers'],
                          a_pp={'detect_outliers_iqr__cols': ['col1', 'col2'],
                                'remove_outliers__cols': ['col1', 'col2'],
                                'remove_outliers__method': 'median'})

# Apply the outlier handling functions
df_processed = preprocessor.apply(df)

By creating this pull request, I confirm that I have read and fully accept and agree with one of the Petrobras' Contributor License Agreements (CLAs):

Our CLAs are based on the Apache Software Foundation's CLAs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant