Unable to Reproduce Results for Supervised Training on Echo Dataset with Mistral-7B-Instruct-v2 #154

ThonyPan · 2024-11-12T15:36:07Z

Hi @vaibhavad,
I’m currently working with an 8xA100 80G setup and attempting to replicate the supervised training process for the Mistral-7B-Instruct-v2 model as described. However, despite following the tutorial instructions, I haven’t been able to achieve the results reported in the paper on the echo dataset. Some of my outcomes are summarized in the table below.

Dataset	Reported	Reproduced	Difference
AmazonCounterfactualClassification	77.58	81.52	3.94
AskUbuntuDupQuestions	63.98	63.74	-0.24
BIOSSES	85.24	82.68	-2.56
MTOPDomainClassification	96.04	94.69	-1.35
MassiveScenarioClassification	81.64	81.03	-0.61
STS12	78.8	73.46	-5.34
STS13	86.37	80.02	-6.35
STS14	84.04	79.09	-4.95
STS15	88.99	86.61	-2.38
STS22	67.68	66.14	-1.54
SprintDuplicateQuestions	96.82	94.06	-2.76
SummEval	29.96	31.14	1.18
ToxicConversationsClassification	69.26	67.82	-1.44
TwitterSemEval2015	80.6	79.09	-1.51

I’ve noticed that in your data_loader code, there are references to files like allnli_split1.jsonl, which are not present in the echo-data that I downloaded. Could you clarify if further preprocessing was applied to the echo-data before training? If so, could you share details on the preprocessing steps? Alternatively, if there was no preprocessing involved, would it be possible for you to provide a Docker image version that can reproduce the reported results?

Environment Details:
• Model: Mistral-7B-Instruct-v2
• Hardware: 8xA100 80G GPUs
• Installed Version: 0.2.2 (installed via pip install -e . locally)
• Code: No modifications made to the original framework

Thank you very much for your help!

ShengYun-Peng · 2024-11-22T19:14:39Z

I also have the same question. @ThonyPan : do you have the results on retrieval datasets, e.g., arguana and scifact? My local evaluation is ~15% lower than the number reported in the paper for llama3 8b instruct

BtlWolf · 2024-11-24T04:05:06Z

@ThonyPan May I ask if the split used is test or dev？

ThonyPan · 2024-11-24T14:44:04Z

I also have the same question. @ThonyPan : do you have the results on retrieval datasets, e.g., arguana and scifact? My local evaluation is ~15% lower than the number reported in the paper for llama3 8b instruct

I have tested the self-trained model on more datasets including some retrieval datasets. The result of SciFact is 74.05, with 4.81 decline while the result on ArguAna is 58.59, with 1.11 increment. The result I got is quite random compared to the reported one, with most of the subsets' result lower than expected. I’m not sure if this is caused by a package version issue or an oversight in dataset preprocessing.

ThonyPan · 2024-11-24T14:44:54Z

@ThonyPan May I ask if the split used is test or dev？

All the results are calculated on the test split of mteb dataset.

stefanhgm · 2024-12-02T08:04:28Z

Hi everyone,

I have a similar issue when training the supervised Llama 3.1 based on the provided mntp version. I use the provided training config and only switched out the local path /home/toolkit/llm2vec/output/mntp/Meta-Llama-3.1-8B-Instruct/checkpoint-1000 with McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp (see below). I do evaluate on another end tasks thank mteb but McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp-supervised consistently performs slightly better than the supervised version I trained myself.

This is the training config I use on 8xA100:

{
    "model_name_or_path": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "peft_model_name_or_path": "McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp",
    "bidirectional": true,
    "pooling_mode": "mean",
    "dataset_name": "E5",
    "dataset_file_path": "cache/echo-data",
    "remove_unused_columns": false,
    "learning_rate": 2e-4,
    "num_train_epochs": 3,
    "warmup_steps": 300,
    "per_device_train_batch_size": 64,
    "per_device_eval_batch_size": 64,
    "gradient_accumulation_steps": 1,
    "do_train": true,
    "disable_tqdm": false,
    "max_seq_length": 512,
    "overwrite_output_dir": true,
    "output_dir": "output/mntp-supervised/Meta-Llama-3.1-8B-Instruct",
    "logging_steps": 50,
    "save_steps": 200,
    "save_only_model": true,
    "stop_after_n_steps": 1000,
    "lora_r": 16,
    "gradient_checkpointing": true,
    "torch_dtype": "bfloat16",
    "attn_implementation": "flash_attention_2",
    "seed": 42
}

TianBaoGe · 2024-12-08T04:31:22Z

I strongly believe the authors need to disclose their complete conda env, training configuration, and training logs, as the majority of people, including myself, have not been able to reproduce the results of the paper so far.

vaibhavad · 2024-12-09T19:03:01Z

@TianBaoGe @stefanhgm @ThonyPan - I completely agree. Unfortunately we could not do it earlier due to deadline rush and data purging rules of our University cluster. However, I have now started re-training to verify this issue and I am logging everything carefully. I will report back on the findings here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Reproduce Results for Supervised Training on Echo Dataset with Mistral-7B-Instruct-v2 #154

Unable to Reproduce Results for Supervised Training on Echo Dataset with Mistral-7B-Instruct-v2 #154

ThonyPan commented Nov 12, 2024 •

edited

Loading

ShengYun-Peng commented Nov 22, 2024 •

edited

Loading

BtlWolf commented Nov 24, 2024

ThonyPan commented Nov 24, 2024

ThonyPan commented Nov 24, 2024

stefanhgm commented Dec 2, 2024

TianBaoGe commented Dec 8, 2024

vaibhavad commented Dec 9, 2024

Unable to Reproduce Results for Supervised Training on Echo Dataset with Mistral-7B-Instruct-v2 #154

Unable to Reproduce Results for Supervised Training on Echo Dataset with Mistral-7B-Instruct-v2 #154

Comments

ThonyPan commented Nov 12, 2024 • edited Loading

ShengYun-Peng commented Nov 22, 2024 • edited Loading

BtlWolf commented Nov 24, 2024

ThonyPan commented Nov 24, 2024

ThonyPan commented Nov 24, 2024

stefanhgm commented Dec 2, 2024

TianBaoGe commented Dec 8, 2024

vaibhavad commented Dec 9, 2024

ThonyPan commented Nov 12, 2024 •

edited

Loading

ShengYun-Peng commented Nov 22, 2024 •

edited

Loading