Example 2 of D4.3 - Pre-processing #11

cozzolinoac11 · 2023-05-08T08:41:26Z

Use case

common

Name of resource

SMOTE dataset balancing

ID

SMOTE_dataset_balancing

Description

Dataset balancing using SMOTE oversampling technique. A balanced dataset is a dataset where each output class (or target class) is represented by the same number of input samples. Imbalanced data is not always a bad thing and there is always some degree of imbalance in real data sets. That said, if the level of imbalance is relatively low, there should not be much impact on the performance of the model but, in some cases, working on unbalanced data could introduce a high error rate. Imbalanced data is one of the potential problems in the field of data mining and machine learning. This problem can be approached by properly analyzing the data. One way to solve this problem is to oversample the examples in the minority class. This can be achieved by simply duplicating examples from the minority class in the training dataset prior to fitting a model. This can balance the class distribution but does not provide any additional information to the model. An improvement on duplicating examples from the minority class is to synthesize new examples from the minority class. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line. Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen, and a synthetic example is created at a randomly selected point between the two examples in feature space.

Main category

Pre-processing

Other category

No response

Publication date

2023-08-05

Objective

dataset-balancing

Platform

Google Colab

Framework

imblearn

Architecture

None

Approach

None

Algorithm

SMOTE - Synthetic-Minority-Oversampling-TEchnique

Processor

cpu

OS

linux

Keyword

dataset balancing, SMOTE

Characteristics of input data

Numpy arrays with unbalanced classes (47.9% - 52.1%)

Biases and ethical aspects

Initial unbalanced dataset

Output data obtained

https://public.epsilon-italia.it/FAIRiCUBE/wildfire-classification/balanced_data_numpy.zip

Characteristics of output data

Numpy arrays with balanced classes (50% - 50%)

Performance

No response

Conditions for access and use

cc-by-4.0

Constraints

No response

KathiSchleidt · 2023-05-08T11:55:09Z

Nice!

However, I'm wondering if a bit more descriptive text would be valuable. Maybe I'm the only one who gets a bit lost in your description, but I fear I don't quite understand what's meant by:

SMOTE
dataset balancing

The Description of "Dataset balancing using SMOTE oversampling technique" doesn't add much more information. Thus, at least to me, it's unclear what this could be used for.

Going back to the FAIRiCUBE core objective, I doubt domain experts like Martin or Heimo will understand more than I do. How can you help us understand???

cozzolinoac11 · 2023-05-11T09:59:32Z

Clearly this is a fairly simplistic example, so it has not been documented in detail, but it was intended to give an idea of how (and with what) to fill in each field according to the type of resource.
That said, I agree that my description was too short and meant for acknowledged people. I have just updated the ‘Description’ field, adding a few more details, to make the resource more understandable. For sure, in the form, we can recommend providing more explanatory descriptions meant also for less expert people. What do you think?

KathiSchleidt · 2023-05-13T10:56:58Z

Agreed! Thanks!

In addition, I'm wondering if we should try and extract relevant keywords (e.g. "SMOTE", "balanced dataset") for inclusion in the Knowledge Base

cozzolinoac11 added documentation Improvements or additions to documentation good first issue Good for newcomers labels May 8, 2023

cozzolinoac11 changed the title ~~Example 2 of D4.3~~ Example 2 of D4.3 - Pre-processing May 8, 2023

cozzolinoac11 mentioned this issue May 8, 2023

additional values ( for several codelists) #5

Open

KathiSchleidt mentioned this issue May 8, 2023

Example 3 of D4.3 - Pre-processing #12

Open

cozzolinoac11 added the a/p metadata label Jun 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example 2 of D4.3 - Pre-processing #11

Example 2 of D4.3 - Pre-processing #11

cozzolinoac11 commented May 8, 2023 •

edited

Loading

KathiSchleidt commented May 8, 2023

cozzolinoac11 commented May 11, 2023

KathiSchleidt commented May 13, 2023

Example 2 of D4.3 - Pre-processing #11

Example 2 of D4.3 - Pre-processing #11

Comments

cozzolinoac11 commented May 8, 2023 • edited Loading

Use case

Name of resource

ID

Description

Main category

Other category

Publication date

Objective

Platform

Framework

Architecture

Approach

Algorithm

Processor

OS

Keyword

Reference link

Example

Input data used

Characteristics of input data

Biases and ethical aspects

Output data obtained

Characteristics of output data

Performance

Conditions for access and use

Constraints

KathiSchleidt commented May 8, 2023

cozzolinoac11 commented May 11, 2023

KathiSchleidt commented May 13, 2023

cozzolinoac11 commented May 8, 2023 •

edited

Loading