Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase (bearish) labeled data #18

Open
1 of 5 tasks
StephanAkkerman opened this issue Dec 15, 2023 · 10 comments
Open
1 of 5 tasks

Increase (bearish) labeled data #18

StephanAkkerman opened this issue Dec 15, 2023 · 10 comments

Comments

@StephanAkkerman
Copy link
Collaborator

StephanAkkerman commented Dec 15, 2023

Bullish Sentiments: 17,368
Bearish Sentiments: 8,542
Neutral Sentiments: 12,181

It would be nice if we could balance the datasets and increase them all, to for instance 25k each

  • Find bullish tweet examples
  • Find bearish tweet examples
  • Find neutral tweet examples
  • Create a prompt that we can use for Mixtral 8x7B for generating synthetic tweets
  • Use together.ai / pplx.ai / anyscale.com for getting the LLM model results
@StephanAkkerman
Copy link
Collaborator Author

Options are:

  • Using Simple Random Oversampling
  • Using SMOTE (Synthetic Minority Over-sampling Technique)
  • ADASYN (Adaptive Synthetic Sampling)
  • GAN

Full description:
ADASYN (Adaptive Synthetic Sampling): This method focuses on generating synthetic data for the minority class, similar to SMOTE. However, ADASYN adapts to the dataset's characteristics and generates more synthetic data for minority class samples that are harder to learn, rather than treating all minority class samples equally.

SMOTE Variants: There are many variants of SMOTE, each designed to address specific issues or dataset characteristics. These include Borderline-SMOTE, K-Means SMOTE, and SVMSMOTE, among others. Each variant modifies the synthetic sample generation process in a way that is better suited for certain types of data.

Generative Adversarial Networks (GANs): GANs can generate new, synthetic samples of data that are remarkably similar to the original dataset. They can be particularly effective for complex data types, like images or text, where simpler oversampling techniques might not capture the nuances of the data.

Ensemble Methods: Some approaches combine oversampling with ensemble learning techniques, like bagging and boosting, to create more robust models. These methods can reduce the likelihood of overfitting, which is a common risk with oversampled data.

Advanced Algorithmic Approaches: Techniques like cluster-based oversampling, where the minority class is oversampled within clusters, can help in maintaining the intrinsic distribution of the minority class.

Deep Learning-Based Techniques: For certain types of data, especially in fields like NLP and image processing, deep learning models can be trained to augment the dataset with new, synthetic instances that are diverse and realistic.

@TimKoornstra
Copy link
Owner

GPT-3.5 or 4 might be able to create quality synthetic data

@StephanAkkerman
Copy link
Collaborator Author

StephanAkkerman commented Dec 16, 2023

ADASYN:

from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import ADASYN
from datasets import load_dataset

# Load dataset
dataset = load_dataset(
    "TimKoornstra/financial-tweets-sentiment",
    split="train",
    cache_dir="data/finetune/",
)

# Extract texts and labels
texts = [example["tweet"] for example in dataset]
labels = [example["sentiment"] for example in dataset]

# Vectorize texts
tfidf = TfidfVectorizer(max_features=1000)  # Adjust the number of features as needed
X = tfidf.fit_transform(texts)

# Apply ADASYN to the subset with label 0
adasyn = ADASYN(random_state=42, n_neighbors=5, sampling_strategy="minority")
X_resampled, y_resampled = adasyn.fit_resample(X, labels)

# Convert X_resampled to a list of texts
texts_resampled = tfidf.inverse_transform(X_resampled)
print(texts_resampled)

@StephanAkkerman
Copy link
Collaborator Author

We could also look into SentiGAN: https://github.com/Nrgeup/SentiGAN

@StephanAkkerman
Copy link
Collaborator Author

StephanAkkerman commented Dec 17, 2023

Both ADASYN and SMOTE return text like this 'amd', 'big', 'for', 'it', 'not', 'while', 'would'

Given the limitations of converting SMOTE's synthetic samples back to text, it's often more practical in NLP tasks to use other data augmentation techniques specifically designed for text, such as paraphrasing or generating new text samples using language models.

Generating new text samples for a minority class in an NLP task, especially for balancing datasets, can be more effective with methods specifically designed for text data. Here are some approaches that are often considered better suited for textual data augmentation:

  1. Paraphrasing: Generate new sentences that convey the same meaning as existing sentences in the minority class. This can be done using:

Rule-based systems that change certain words or phrases while keeping the overall meaning intact.
Machine translation models, where you translate the text to another language and then back to the original language (back-translation).
Pre-trained models like T5 or BART that can be fine-tuned for paraphrasing tasks.
Using Pre-trained Language Models: Leverage large pre-trained models like GPT-3, GPT-2, or BERT for text generation. You can prompt these models with existing sentences and let them generate continuations or variations.

  1. Synonym Replacement: Replace words in sentences with their synonyms. This method keeps the sentence structure intact while slightly altering the content. Tools like NLTK or Spacy can be used to find synonyms.

  2. Random Insertion, Deletion, and Swapping: Randomly insert, delete, or swap words in a sentence. This method creates variations of the existing sentences, although it can sometimes alter the meaning or grammatical correctness.

  3. Data Augmentation Libraries: Use NLP data augmentation libraries like nlpaug or textattack, which offer a variety of text transformation techniques including the ones mentioned above.

  4. Conditional Text Generation Models: Train a text generation model on your dataset, conditioned on the class label. This way, the model learns to generate text that is representative of a specific class.

  5. Crowdsourcing or Expert Generation: If resources permit, manually creating new text samples through crowdsourcing or expert input can be highly effective, especially for domain-specific tasks.

@StephanAkkerman
Copy link
Collaborator Author

Generate 10k rows of the neutral labeled text to balance the classes. I will start with the following methods to test performance:

  1. Simple Random Oversampling
  2. Synonym replacement using nlpaug
  3. GAN model
  4. GPT API

StephanAkkerman added a commit that referenced this issue Dec 17, 2023
StephanAkkerman added a commit that referenced this issue Dec 18, 2023
@StephanAkkerman
Copy link
Collaborator Author

Research into GANs performance on balancing datasets: https://www-sciencedirect-com.proxy.library.uu.nl/science/article/pii/S1110866522000342

CatGAN (and more): https://github.com/williamSYSU/TextGAN-PyTorch

@StephanAkkerman
Copy link
Collaborator Author

Some good neutral examples from our dataset:

  • $TSLA is this going up, down, left, or right tomorrow 💫
  • Who is playing $fb ? Lol
  • make stock trading in the metaverse a #reality @SpeakerPelosi $FB
  • Should You Follow Berkshire Hathaway Into Apple Stock?
  • $NVDA sideways
  • Apache Co. $APA Given Consensus Rating of “Hold” by Analysts https://t.co/ahyp2cxFKb #stocks
  • $AMZN 2266 is today target
  • Most searched small-cap stocks, Tue Mar 30th - $WKEY $VET $UEC $SEAC $NNOX $MAXN $HMBL $GNUS $FLGT $DLPN $EYES… https://t.co/kXJS7n9ly3
  • $WNRS https://t.co/i14tn4QZ8P
  • Thank you @jack for making @twitter. It is time to move on. $TWTR

@StephanAkkerman
Copy link
Collaborator Author

Use the examples above in combination with Mixtral (available on HuggingChat) and the following prompt:

Create synthetic neutral tweets about the financial market. Use the following tweets as an example:
$TSLA is this going up, down, left, or right tomorrow 💫
Who is playing $fb ? Lol
make stock trading in the metaverse a #reality @SpeakerPelosi $FB
Should You Follow Berkshire Hathaway Into Apple Stock?
$NVDA sideways
Apache Co. $APA Given Consensus Rating of “Hold” by Analysts https://t.co/ahyp2cxFKb #stocks
$AMZN 2266 is today target
Most searched small-cap stocks, Tue Mar 30th - $WKEY $VET $UEC $SEAC $NNOX $MAXN $HMBL $GNUS $FLGT $DLPN $EYES… https://t.co/kXJS7n9ly3
$WNRS https://t.co/i14tn4QZ8P
Thank you @jack for making @twitter. It is time to move on. $TWTR

@StephanAkkerman StephanAkkerman changed the title Increase neutral labeled data Increase (neutral) labeled data Dec 21, 2023
@StephanAkkerman StephanAkkerman changed the title Increase (neutral) labeled data Increase (bearish) labeled data Jan 2, 2024
StephanAkkerman added a commit that referenced this issue Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants