Added functions for outlier detection and handling #53

castrokelly · 2024-10-17T04:52:03Z

Added functions for outlier detection and handling

This pull request adds two new functions to the bibmon.PreProcess class for outlier detection and handling:

detect_outliers_iqr(df, cols):
- Detects outliers in the specified columns of a DataFrame using the Interquartile Range (IQR) method.
- Returns a DataFrame with the same shape as the input, where outliers are flagged with 1 and other points with 0.
- This function can be used to identify potential outliers in the data before applying a machine learning model.
remove_outliers(df, cols, method='remove'):
- Removes or handles outliers in the specified columns of a DataFrame using the IQR method.
- Offers three methods for handling outliers:
  - remove: Removes outliers from the DataFrame.
  - median: Replaces outliers with the median value of the column.
  - winsorize: Applies winsorization to limit extreme values.
- Returns a DataFrame with the outliers removed or handled.
- This function can be used to preprocess the data and improve the robustness of machine learning models.

Motivation:

Outliers can significantly affect the performance of machine learning models, especially those used for anomaly detection. By detecting and handling outliers, we can improve the accuracy and reliability of the models, which is crucial for the effective use of BibMon with real-world datasets like the 3W Dataset, known for its diverse and potentially noisy data.

Benefits:

Improved model accuracy
Reduced false alarms
Increased efficiency in data preprocessing

Example usage:

from bibmon import PreProcess

# Create a PreProcess object
preprocessor = PreProcess(f_pp=['detect_outliers_iqr', 'remove_outliers'],
                          a_pp={'detect_outliers_iqr__cols': ['col1', 'col2'],
                                'remove_outliers__cols': ['col1', 'col2'],
                                'remove_outliers__method': 'median'})

# Apply the outlier handling functions
df_processed = preprocessor.apply(df)

By creating this pull request, I confirm that I have read and fully accept and agree with one of the Petrobras' Contributor License Agreements (CLAs):

ICLA: Individual Contributor License Agreement;
CCLA: Corporate Contributor License Agreement.

Our CLAs are based on the Apache Software Foundation's CLAs:

ICLA: Individual Contributor License Agreement;
CCLA: Corporate Contributor License Agreement.

Added functions for detection and treatment of outliers

a7992ad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added functions for outlier detection and handling #53

Added functions for outlier detection and handling #53

castrokelly commented Oct 17, 2024 •

edited

Loading

Added functions for outlier detection and handling #53

Are you sure you want to change the base?

Added functions for outlier detection and handling #53

Conversation

castrokelly commented Oct 17, 2024 • edited Loading

Added functions for outlier detection and handling

castrokelly commented Oct 17, 2024 •

edited

Loading