Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Reproduce Results for Supervised Training on Echo Dataset with Mistral-7B-Instruct-v2 #154

Open
ThonyPan opened this issue Nov 12, 2024 · 7 comments

Comments

@ThonyPan
Copy link

ThonyPan commented Nov 12, 2024

Hi @vaibhavad,
I’m currently working with an 8xA100 80G setup and attempting to replicate the supervised training process for the Mistral-7B-Instruct-v2 model as described. However, despite following the tutorial instructions, I haven’t been able to achieve the results reported in the paper on the echo dataset. Some of my outcomes are summarized in the table below.

Dataset Reported Reproduced Difference
AmazonCounterfactualClassification 77.58 81.52 3.94
AskUbuntuDupQuestions 63.98 63.74 -0.24
BIOSSES 85.24 82.68 -2.56
MTOPDomainClassification 96.04 94.69 -1.35
MassiveScenarioClassification 81.64 81.03 -0.61
STS12 78.8 73.46 -5.34
STS13 86.37 80.02 -6.35
STS14 84.04 79.09 -4.95
STS15 88.99 86.61 -2.38
STS22 67.68 66.14 -1.54
SprintDuplicateQuestions 96.82 94.06 -2.76
SummEval 29.96 31.14 1.18
ToxicConversationsClassification 69.26 67.82 -1.44
TwitterSemEval2015 80.6 79.09 -1.51

I’ve noticed that in your data_loader code, there are references to files like allnli_split1.jsonl, which are not present in the echo-data that I downloaded. Could you clarify if further preprocessing was applied to the echo-data before training? If so, could you share details on the preprocessing steps? Alternatively, if there was no preprocessing involved, would it be possible for you to provide a Docker image version that can reproduce the reported results?

Environment Details:
• Model: Mistral-7B-Instruct-v2
• Hardware: 8xA100 80G GPUs
• Installed Version: 0.2.2 (installed via pip install -e . locally)
• Code: No modifications made to the original framework

Thank you very much for your help!

@ShengYun-Peng
Copy link

ShengYun-Peng commented Nov 22, 2024

I also have the same question. @ThonyPan : do you have the results on retrieval datasets, e.g., arguana and scifact? My local evaluation is ~15% lower than the number reported in the paper for llama3 8b instruct

@BtlWolf
Copy link

BtlWolf commented Nov 24, 2024

@ThonyPan May I ask if the split used is test or dev?

@ThonyPan
Copy link
Author

I also have the same question. @ThonyPan : do you have the results on retrieval datasets, e.g., arguana and scifact? My local evaluation is ~15% lower than the number reported in the paper for llama3 8b instruct

I have tested the self-trained model on more datasets including some retrieval datasets. The result of SciFact is 74.05, with 4.81 decline while the result on ArguAna is 58.59, with 1.11 increment. The result I got is quite random compared to the reported one, with most of the subsets' result lower than expected. I’m not sure if this is caused by a package version issue or an oversight in dataset preprocessing.

@ThonyPan
Copy link
Author

@ThonyPan May I ask if the split used is test or dev?

All the results are calculated on the test split of mteb dataset.

@stefanhgm
Copy link

Hi everyone,

I have a similar issue when training the supervised Llama 3.1 based on the provided mntp version. I use the provided training config and only switched out the local path /home/toolkit/llm2vec/output/mntp/Meta-Llama-3.1-8B-Instruct/checkpoint-1000 with McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp (see below). I do evaluate on another end tasks thank mteb but McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp-supervised consistently performs slightly better than the supervised version I trained myself.

This is the training config I use on 8xA100:

{
    "model_name_or_path": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "peft_model_name_or_path": "McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp",
    "bidirectional": true,
    "pooling_mode": "mean",
    "dataset_name": "E5",
    "dataset_file_path": "cache/echo-data",
    "remove_unused_columns": false,
    "learning_rate": 2e-4,
    "num_train_epochs": 3,
    "warmup_steps": 300,
    "per_device_train_batch_size": 64,
    "per_device_eval_batch_size": 64,
    "gradient_accumulation_steps": 1,
    "do_train": true,
    "disable_tqdm": false,
    "max_seq_length": 512,
    "overwrite_output_dir": true,
    "output_dir": "output/mntp-supervised/Meta-Llama-3.1-8B-Instruct",
    "logging_steps": 50,
    "save_steps": 200,
    "save_only_model": true,
    "stop_after_n_steps": 1000,
    "lora_r": 16,
    "gradient_checkpointing": true,
    "torch_dtype": "bfloat16",
    "attn_implementation": "flash_attention_2",
    "seed": 42
}

@TianBaoGe
Copy link

I strongly believe the authors need to disclose their complete conda env, training configuration, and training logs, as the majority of people, including myself, have not been able to reproduce the results of the paper so far.

@vaibhavad
Copy link
Collaborator

@TianBaoGe @stefanhgm @ThonyPan - I completely agree. Unfortunately we could not do it earlier due to deadline rush and data purging rules of our University cluster. However, I have now started re-training to verify this issue and I am logging everything carefully. I will report back on the findings here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants