Discrepancy in STS17 Results for S-LLaMA-1.3B #151

newfish-lab · 2024-10-20T16:18:07Z

Hello,

I followed the instructions in the README to evaluate the S-LLaMA-1.3B model on the STS17 task. However, the results I obtained are significantly different from those reported in the paper.

Command Used:
python experiments/mteb_eval.py --model_name McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp-unsup-simcse \ --task_name STS17 \ --task_to_instructions_fp test_configs/mteb/task_to_instructions.json \ --output_dir results

Results:
{
"dataset_revision": "faeb762787bd10488a50c8b5be4a3b82e411949c",
"evaluation_time": 57.001840114593506,
"kg_co2_emissions": null,
"mteb_version": "1.15.7",
"scores": {
"test": [
{
"cosine_pearson": 0.8068578432316841,
"cosine_spearman": 0.8169906637042349,
"euclidean_pearson": 0.7955408874118524,
"euclidean_spearman": 0.7998558030435642,
"hf_subset": "en-en",
"languages": [
"eng-Latn"
],
"main_score": 0.8169906637042349,
"manhattan_pearson": 0.8035899663153354,
"manhattan_spearman": 0.8065731327735807,
"pearson": 0.8068578432316841,
"spearman": 0.8169906637042349
},
{
"cosine_pearson": 0.15780754110461276,
"cosine_spearman": 0.12083310897735632,
"euclidean_pearson": 0.030643892007343198,
"euclidean_spearman": 0.03322321405535968,
"hf_subset": "en-ar",
"languages": [
"eng-Latn",
"ara-Arab"
],
"main_score": 0.12083310897735632,
"manhattan_pearson": -0.023743020301998822,
"manhattan_spearman": -0.019246085009935885,
"pearson": 0.15780754110461276,
"spearman": 0.12083310897735632
},
{
"cosine_pearson": 0.6054351354779678,
"cosine_spearman": 0.6162430244748419,
"euclidean_pearson": 0.40173825736167984,
"euclidean_spearman": 0.3933921991621478,
"hf_subset": "en-de",
"languages": [
"eng-Latn",
"deu-Latn"
],
"main_score": 0.6162430244748419,
"manhattan_pearson": 0.457962615505441,
"manhattan_spearman": 0.4838933699573373,
"pearson": 0.6054351354779678,
"spearman": 0.6162430244748419
},
{
"cosine_pearson": 0.7618940413983072,
"cosine_spearman": 0.7911176554969499,
"euclidean_pearson": 0.7687817432571042,
"euclidean_spearman": 0.7859527624136942,
"hf_subset": "es-es",
"languages": [
"spa-Latn"
],
"main_score": 0.7911176554969499,
"manhattan_pearson": 0.7794036752505948,
"manhattan_spearman": 0.7938492963891066,
"pearson": 0.7618940413983072,
"spearman": 0.7911176554969499
},
{
"cosine_pearson": 0.48420847537315587,
"cosine_spearman": 0.49317078071750625,
"euclidean_pearson": 0.3563428734227356,
"euclidean_spearman": 0.3584458165353449,
"hf_subset": "nl-en",
"languages": [
"nld-Latn",
"eng-Latn"
],
"main_score": 0.49317078071750625,
"manhattan_pearson": 0.4198271485260075,
"manhattan_spearman": 0.3836819578854696,
"pearson": 0.48420847537315587,
"spearman": 0.49317078071750625
},
{
"cosine_pearson": 0.5288823574377371,
"cosine_spearman": 0.562709079229829,
"euclidean_pearson": 0.2993909341777525,
"euclidean_spearman": 0.2992292640046535,
"hf_subset": "es-en",
"languages": [
"spa-Latn",
"eng-Latn"
],
"main_score": 0.562709079229829,
"manhattan_pearson": 0.424622955248226,
"manhattan_spearman": 0.4414351300043983,
"pearson": 0.5288823574377371,
"spearman": 0.562709079229829
},
{
"cosine_pearson": 0.47180324561704823,
"cosine_spearman": 0.5279642783201307,
"euclidean_pearson": 0.5100329065437332,
"euclidean_spearman": 0.5196472282696352,
"hf_subset": "ko-ko",
"languages": [
"kor-Hang"
],
"main_score": 0.5279642783201307,
"manhattan_pearson": 0.5156613233195979,
"manhattan_spearman": 0.5229021826790656,
"pearson": 0.47180324561704823,
"spearman": 0.5279642783201307
},
{
"cosine_pearson": 0.45567578510194695,
"cosine_spearman": 0.4759593706055199,
"euclidean_pearson": 0.5070517972856654,
"euclidean_spearman": 0.48261827460057777,
"hf_subset": "ar-ar",
"languages": [
"ara-Arab"
],
"main_score": 0.4759593706055199,
"manhattan_pearson": 0.5177089933394046,
"manhattan_spearman": 0.4914281145552357,
"pearson": 0.45567578510194695,
"spearman": 0.4759593706055199
},
{
"cosine_pearson": 0.15581738888547184,
"cosine_spearman": 0.1495641486110853,
"euclidean_pearson": 0.1349394931764104,
"euclidean_spearman": 0.13081927359268367,
"hf_subset": "en-tr",
"languages": [
"eng-Latn",
"tur-Latn"
],
"main_score": 0.1495641486110853,
"manhattan_pearson": 0.12970038975673026,
"manhattan_spearman": 0.14147949612647964,
"pearson": 0.15581738888547184,
"spearman": 0.1495641486110853
},
{
"cosine_pearson": 0.5667402604578623,
"cosine_spearman": 0.5826079418593232,
"euclidean_pearson": 0.38450094531647017,
"euclidean_spearman": 0.40102938983888436,
"hf_subset": "fr-en",
"languages": [
"fra-Latn",
"eng-Latn"
],
"main_score": 0.5826079418593232,
"manhattan_pearson": 0.4720522029904462,
"manhattan_spearman": 0.47395940988793417,
"pearson": 0.5667402604578623,
"spearman": 0.5826079418593232
},
{
"cosine_pearson": 0.5427131797486067,
"cosine_spearman": 0.5631847685301092,
"euclidean_pearson": 0.34861333517971227,
"euclidean_spearman": 0.32753161608389836,
"hf_subset": "it-en",
"languages": [
"ita-Latn",
"eng-Latn"
],
"main_score": 0.5631847685301092,
"manhattan_pearson": 0.40648821394761325,
"manhattan_spearman": 0.3918795987336719,
"pearson": 0.5427131797486067,
"spearman": 0.5631847685301092
}
]
},
"task_name": "STS17"
}

Could you please help me understand why there is such a discrepancy? Is there any additional setup or configuration that I might have missed?

The text was updated successfully, but these errors were encountered:

vaibhavad · 2024-10-20T23:47:41Z

Hi @newfish-lab,

The STS17 task in current version of MTEB is a crosslingual task. However, the leaderboard only considers the English subset. As you can see in your results file, there are multiple language subsets (denoted by hf_subset). The score in en-en in the above results is 81.69. The reported score is 81.672.

To select specific subset of a task, please refer to the documentation on MTEB library.

Let me know if you have any more questions.

newfish-lab · 2024-10-21T00:19:03Z

Thank you so much! I also wanted to know how you trained the uni+simcse of S-LLaMA-1.3B. Did you simply set the is_causal attribute of the attention to true and comment out the overridden _update_causal_mask function? I tried this way, but the performance was quite poor.

newfish-lab · 2024-10-21T01:05:42Z

Moreover, I have encountered another issue: the validation code seems to have problems when testing with other task:
Command Used:
python experiments/mteb_eval.py --model_name McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp-unsup-simcse \ --task_name SciFact \ --task_to_instructions_fp test_configs/mteb/task_to_instructions.json \ --output_dir results

Error: TypeError: LLM2Vec.encode() got an unexpected keyword argument 'task_name'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in STS17 Results for S-LLaMA-1.3B #151

Discrepancy in STS17 Results for S-LLaMA-1.3B #151

newfish-lab commented Oct 20, 2024

vaibhavad commented Oct 20, 2024

newfish-lab commented Oct 21, 2024 •

edited

Loading

newfish-lab commented Oct 21, 2024 •

edited

Loading

Discrepancy in STS17 Results for S-LLaMA-1.3B #151

Discrepancy in STS17 Results for S-LLaMA-1.3B #151

Comments

newfish-lab commented Oct 20, 2024

vaibhavad commented Oct 20, 2024

newfish-lab commented Oct 21, 2024 • edited Loading

newfish-lab commented Oct 21, 2024 • edited Loading

newfish-lab commented Oct 21, 2024 •

edited

Loading

newfish-lab commented Oct 21, 2024 •

edited

Loading