-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase (bearish) labeled data #18
Comments
Options are:
Full description: SMOTE Variants: There are many variants of SMOTE, each designed to address specific issues or dataset characteristics. These include Borderline-SMOTE, K-Means SMOTE, and SVMSMOTE, among others. Each variant modifies the synthetic sample generation process in a way that is better suited for certain types of data. Generative Adversarial Networks (GANs): GANs can generate new, synthetic samples of data that are remarkably similar to the original dataset. They can be particularly effective for complex data types, like images or text, where simpler oversampling techniques might not capture the nuances of the data. Ensemble Methods: Some approaches combine oversampling with ensemble learning techniques, like bagging and boosting, to create more robust models. These methods can reduce the likelihood of overfitting, which is a common risk with oversampled data. Advanced Algorithmic Approaches: Techniques like cluster-based oversampling, where the minority class is oversampled within clusters, can help in maintaining the intrinsic distribution of the minority class. Deep Learning-Based Techniques: For certain types of data, especially in fields like NLP and image processing, deep learning models can be trained to augment the dataset with new, synthetic instances that are diverse and realistic. |
GPT-3.5 or 4 might be able to create quality synthetic data |
ADASYN: from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import ADASYN
from datasets import load_dataset
# Load dataset
dataset = load_dataset(
"TimKoornstra/financial-tweets-sentiment",
split="train",
cache_dir="data/finetune/",
)
# Extract texts and labels
texts = [example["tweet"] for example in dataset]
labels = [example["sentiment"] for example in dataset]
# Vectorize texts
tfidf = TfidfVectorizer(max_features=1000) # Adjust the number of features as needed
X = tfidf.fit_transform(texts)
# Apply ADASYN to the subset with label 0
adasyn = ADASYN(random_state=42, n_neighbors=5, sampling_strategy="minority")
X_resampled, y_resampled = adasyn.fit_resample(X, labels)
# Convert X_resampled to a list of texts
texts_resampled = tfidf.inverse_transform(X_resampled)
print(texts_resampled) |
We could also look into SentiGAN: https://github.com/Nrgeup/SentiGAN |
Both ADASYN and SMOTE return text like this Given the limitations of converting SMOTE's synthetic samples back to text, it's often more practical in NLP tasks to use other data augmentation techniques specifically designed for text, such as paraphrasing or generating new text samples using language models. Generating new text samples for a minority class in an NLP task, especially for balancing datasets, can be more effective with methods specifically designed for text data. Here are some approaches that are often considered better suited for textual data augmentation:
Rule-based systems that change certain words or phrases while keeping the overall meaning intact.
|
Generate 10k rows of the neutral labeled text to balance the classes. I will start with the following methods to test performance:
|
Research into GANs performance on balancing datasets: https://www-sciencedirect-com.proxy.library.uu.nl/science/article/pii/S1110866522000342 CatGAN (and more): https://github.com/williamSYSU/TextGAN-PyTorch |
Some good neutral examples from our dataset:
|
Use the examples above in combination with Mixtral (available on HuggingChat) and the following prompt:
|
Bullish Sentiments: 17,368
Bearish Sentiments: 8,542
Neutral Sentiments: 12,181
It would be nice if we could balance the datasets and increase them all, to for instance 25k each
The text was updated successfully, but these errors were encountered: