Increase (bearish) labeled data #18

StephanAkkerman · 2023-12-15T09:22:54Z

Bullish Sentiments: 17,368
Bearish Sentiments: 8,542
Neutral Sentiments: 12,181

It would be nice if we could balance the datasets and increase them all, to for instance 25k each

Find bullish tweet examples
Find bearish tweet examples
Find neutral tweet examples
Create a prompt that we can use for Mixtral 8x7B for generating synthetic tweets
Use together.ai / pplx.ai / anyscale.com for getting the LLM model results

StephanAkkerman · 2023-12-15T09:26:43Z

Options are:

Using Simple Random Oversampling
Using SMOTE (Synthetic Minority Over-sampling Technique)
ADASYN (Adaptive Synthetic Sampling)
GAN

Full description:
ADASYN (Adaptive Synthetic Sampling): This method focuses on generating synthetic data for the minority class, similar to SMOTE. However, ADASYN adapts to the dataset's characteristics and generates more synthetic data for minority class samples that are harder to learn, rather than treating all minority class samples equally.

SMOTE Variants: There are many variants of SMOTE, each designed to address specific issues or dataset characteristics. These include Borderline-SMOTE, K-Means SMOTE, and SVMSMOTE, among others. Each variant modifies the synthetic sample generation process in a way that is better suited for certain types of data.

Generative Adversarial Networks (GANs): GANs can generate new, synthetic samples of data that are remarkably similar to the original dataset. They can be particularly effective for complex data types, like images or text, where simpler oversampling techniques might not capture the nuances of the data.

Ensemble Methods: Some approaches combine oversampling with ensemble learning techniques, like bagging and boosting, to create more robust models. These methods can reduce the likelihood of overfitting, which is a common risk with oversampled data.

Advanced Algorithmic Approaches: Techniques like cluster-based oversampling, where the minority class is oversampled within clusters, can help in maintaining the intrinsic distribution of the minority class.

Deep Learning-Based Techniques: For certain types of data, especially in fields like NLP and image processing, deep learning models can be trained to augment the dataset with new, synthetic instances that are diverse and realistic.

TimKoornstra · 2023-12-15T18:48:08Z

GPT-3.5 or 4 might be able to create quality synthetic data

StephanAkkerman · 2023-12-16T18:40:56Z

ADASYN:

from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import ADASYN
from datasets import load_dataset

# Load dataset
dataset = load_dataset(
    "TimKoornstra/financial-tweets-sentiment",
    split="train",
    cache_dir="data/finetune/",
)

# Extract texts and labels
texts = [example["tweet"] for example in dataset]
labels = [example["sentiment"] for example in dataset]

# Vectorize texts
tfidf = TfidfVectorizer(max_features=1000)  # Adjust the number of features as needed
X = tfidf.fit_transform(texts)

# Apply ADASYN to the subset with label 0
adasyn = ADASYN(random_state=42, n_neighbors=5, sampling_strategy="minority")
X_resampled, y_resampled = adasyn.fit_resample(X, labels)

# Convert X_resampled to a list of texts
texts_resampled = tfidf.inverse_transform(X_resampled)
print(texts_resampled)

StephanAkkerman · 2023-12-17T13:34:21Z

Synthetic Oversampling methods comparisons:

StephanAkkerman · 2023-12-17T16:17:43Z

We could also look into SentiGAN: https://github.com/Nrgeup/SentiGAN

StephanAkkerman · 2023-12-17T16:24:30Z

Both ADASYN and SMOTE return text like this 'amd', 'big', 'for', 'it', 'not', 'while', 'would'

Given the limitations of converting SMOTE's synthetic samples back to text, it's often more practical in NLP tasks to use other data augmentation techniques specifically designed for text, such as paraphrasing or generating new text samples using language models.

Generating new text samples for a minority class in an NLP task, especially for balancing datasets, can be more effective with methods specifically designed for text data. Here are some approaches that are often considered better suited for textual data augmentation:

Paraphrasing: Generate new sentences that convey the same meaning as existing sentences in the minority class. This can be done using:

Rule-based systems that change certain words or phrases while keeping the overall meaning intact.
Machine translation models, where you translate the text to another language and then back to the original language (back-translation).
Pre-trained models like T5 or BART that can be fine-tuned for paraphrasing tasks.
Using Pre-trained Language Models: Leverage large pre-trained models like GPT-3, GPT-2, or BERT for text generation. You can prompt these models with existing sentences and let them generate continuations or variations.

Synonym Replacement: Replace words in sentences with their synonyms. This method keeps the sentence structure intact while slightly altering the content. Tools like NLTK or Spacy can be used to find synonyms.
Random Insertion, Deletion, and Swapping: Randomly insert, delete, or swap words in a sentence. This method creates variations of the existing sentences, although it can sometimes alter the meaning or grammatical correctness.
Data Augmentation Libraries: Use NLP data augmentation libraries like nlpaug or textattack, which offer a variety of text transformation techniques including the ones mentioned above.
Conditional Text Generation Models: Train a text generation model on your dataset, conditioned on the class label. This way, the model learns to generate text that is representative of a specific class.
Crowdsourcing or Expert Generation: If resources permit, manually creating new text samples through crowdsourcing or expert input can be highly effective, especially for domain-specific tasks.

StephanAkkerman · 2023-12-17T16:37:22Z

Generate 10k rows of the neutral labeled text to balance the classes. I will start with the following methods to test performance:

Simple Random Oversampling
Synonym replacement using nlpaug
GAN model
GPT API

#18

StephanAkkerman · 2023-12-18T15:14:25Z

Research into GANs performance on balancing datasets: https://www-sciencedirect-com.proxy.library.uu.nl/science/article/pii/S1110866522000342

CatGAN (and more): https://github.com/williamSYSU/TextGAN-PyTorch

StephanAkkerman · 2023-12-20T10:31:45Z

Some good neutral examples from our dataset:

$TSLA is this going up, down, left, or right tomorrow 💫
Who is playing $fb ? Lol
make stock trading in the metaverse a #reality @SpeakerPelosi $FB
Should You Follow Berkshire Hathaway Into Apple Stock?
$NVDA sideways
Apache Co. $APA Given Consensus Rating of “Hold” by Analysts https://t.co/ahyp2cxFKb #stocks
$AMZN 2266 is today target
Most searched small-cap stocks, Tue Mar 30th - $WKEY $VET $UEC $SEAC $NNOX $MAXN $HMBL $GNUS $FLGT $DLPN $EYES… https://t.co/kXJS7n9ly3
$WNRS https://t.co/i14tn4QZ8P
Thank you @jack for making @twitter. It is time to move on. $TWTR

StephanAkkerman · 2023-12-20T18:15:40Z

Use the examples above in combination with Mixtral (available on HuggingChat) and the following prompt:

Create synthetic neutral tweets about the financial market. Use the following tweets as an example:
$TSLA is this going up, down, left, or right tomorrow 💫
Who is playing $fb ? Lol
make stock trading in the metaverse a #reality @SpeakerPelosi $FB
Should You Follow Berkshire Hathaway Into Apple Stock?
$NVDA sideways
Apache Co. $APA Given Consensus Rating of “Hold” by Analysts https://t.co/ahyp2cxFKb #stocks
$AMZN 2266 is today target
Most searched small-cap stocks, Tue Mar 30th - $WKEY $VET $UEC $SEAC $NNOX $MAXN $HMBL $GNUS $FLGT $DLPN $EYES… https://t.co/kXJS7n9ly3
$WNRS https://t.co/i14tn4QZ8P
Thank you @jack for making @twitter. It is time to move on. $TWTR

StephanAkkerman added a commit that referenced this issue Dec 17, 2023

Add random oversampler

d564aff

#18

StephanAkkerman added a commit that referenced this issue Dec 18, 2023

Synonym oversample implemented

f55ef4e

#18

StephanAkkerman changed the title ~~Increase neutral labeled data~~ Increase (neutral) labeled data Dec 21, 2023

StephanAkkerman changed the title ~~Increase (neutral) labeled data~~ Increase (bearish) labeled data Jan 2, 2024

StephanAkkerman added a commit that referenced this issue Jan 23, 2024

Added synthetic tweets code #18

63ac7d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase (bearish) labeled data #18

Increase (bearish) labeled data #18

StephanAkkerman commented Dec 15, 2023 •

edited

Loading

StephanAkkerman commented Dec 15, 2023

TimKoornstra commented Dec 15, 2023

StephanAkkerman commented Dec 16, 2023 •

edited

Loading

StephanAkkerman commented Dec 17, 2023

StephanAkkerman commented Dec 17, 2023

StephanAkkerman commented Dec 17, 2023 •

edited

Loading

StephanAkkerman commented Dec 17, 2023

StephanAkkerman commented Dec 18, 2023

StephanAkkerman commented Dec 20, 2023

StephanAkkerman commented Dec 20, 2023

Increase (bearish) labeled data #18

Increase (bearish) labeled data #18

Comments

StephanAkkerman commented Dec 15, 2023 • edited Loading

StephanAkkerman commented Dec 15, 2023

TimKoornstra commented Dec 15, 2023

StephanAkkerman commented Dec 16, 2023 • edited Loading

StephanAkkerman commented Dec 17, 2023

StephanAkkerman commented Dec 17, 2023

StephanAkkerman commented Dec 17, 2023 • edited Loading

StephanAkkerman commented Dec 17, 2023

StephanAkkerman commented Dec 18, 2023

StephanAkkerman commented Dec 20, 2023

StephanAkkerman commented Dec 20, 2023

StephanAkkerman commented Dec 15, 2023 •

edited

Loading

StephanAkkerman commented Dec 16, 2023 •

edited

Loading

StephanAkkerman commented Dec 17, 2023 •

edited

Loading