Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in STS17 Results for S-LLaMA-1.3B #151

Open
newfish-lab opened this issue Oct 20, 2024 · 3 comments
Open

Discrepancy in STS17 Results for S-LLaMA-1.3B #151

newfish-lab opened this issue Oct 20, 2024 · 3 comments

Comments

@newfish-lab
Copy link

Hello,

I followed the instructions in the README to evaluate the S-LLaMA-1.3B model on the STS17 task. However, the results I obtained are significantly different from those reported in the paper.

Command Used:
python experiments/mteb_eval.py --model_name McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp-unsup-simcse \ --task_name STS17 \ --task_to_instructions_fp test_configs/mteb/task_to_instructions.json \ --output_dir results

Results:
{
"dataset_revision": "faeb762787bd10488a50c8b5be4a3b82e411949c",
"evaluation_time": 57.001840114593506,
"kg_co2_emissions": null,
"mteb_version": "1.15.7",
"scores": {
"test": [
{
"cosine_pearson": 0.8068578432316841,
"cosine_spearman": 0.8169906637042349,
"euclidean_pearson": 0.7955408874118524,
"euclidean_spearman": 0.7998558030435642,
"hf_subset": "en-en",
"languages": [
"eng-Latn"
],
"main_score": 0.8169906637042349,
"manhattan_pearson": 0.8035899663153354,
"manhattan_spearman": 0.8065731327735807,
"pearson": 0.8068578432316841,
"spearman": 0.8169906637042349
},
{
"cosine_pearson": 0.15780754110461276,
"cosine_spearman": 0.12083310897735632,
"euclidean_pearson": 0.030643892007343198,
"euclidean_spearman": 0.03322321405535968,
"hf_subset": "en-ar",
"languages": [
"eng-Latn",
"ara-Arab"
],
"main_score": 0.12083310897735632,
"manhattan_pearson": -0.023743020301998822,
"manhattan_spearman": -0.019246085009935885,
"pearson": 0.15780754110461276,
"spearman": 0.12083310897735632
},
{
"cosine_pearson": 0.6054351354779678,
"cosine_spearman": 0.6162430244748419,
"euclidean_pearson": 0.40173825736167984,
"euclidean_spearman": 0.3933921991621478,
"hf_subset": "en-de",
"languages": [
"eng-Latn",
"deu-Latn"
],
"main_score": 0.6162430244748419,
"manhattan_pearson": 0.457962615505441,
"manhattan_spearman": 0.4838933699573373,
"pearson": 0.6054351354779678,
"spearman": 0.6162430244748419
},
{
"cosine_pearson": 0.7618940413983072,
"cosine_spearman": 0.7911176554969499,
"euclidean_pearson": 0.7687817432571042,
"euclidean_spearman": 0.7859527624136942,
"hf_subset": "es-es",
"languages": [
"spa-Latn"
],
"main_score": 0.7911176554969499,
"manhattan_pearson": 0.7794036752505948,
"manhattan_spearman": 0.7938492963891066,
"pearson": 0.7618940413983072,
"spearman": 0.7911176554969499
},
{
"cosine_pearson": 0.48420847537315587,
"cosine_spearman": 0.49317078071750625,
"euclidean_pearson": 0.3563428734227356,
"euclidean_spearman": 0.3584458165353449,
"hf_subset": "nl-en",
"languages": [
"nld-Latn",
"eng-Latn"
],
"main_score": 0.49317078071750625,
"manhattan_pearson": 0.4198271485260075,
"manhattan_spearman": 0.3836819578854696,
"pearson": 0.48420847537315587,
"spearman": 0.49317078071750625
},
{
"cosine_pearson": 0.5288823574377371,
"cosine_spearman": 0.562709079229829,
"euclidean_pearson": 0.2993909341777525,
"euclidean_spearman": 0.2992292640046535,
"hf_subset": "es-en",
"languages": [
"spa-Latn",
"eng-Latn"
],
"main_score": 0.562709079229829,
"manhattan_pearson": 0.424622955248226,
"manhattan_spearman": 0.4414351300043983,
"pearson": 0.5288823574377371,
"spearman": 0.562709079229829
},
{
"cosine_pearson": 0.47180324561704823,
"cosine_spearman": 0.5279642783201307,
"euclidean_pearson": 0.5100329065437332,
"euclidean_spearman": 0.5196472282696352,
"hf_subset": "ko-ko",
"languages": [
"kor-Hang"
],
"main_score": 0.5279642783201307,
"manhattan_pearson": 0.5156613233195979,
"manhattan_spearman": 0.5229021826790656,
"pearson": 0.47180324561704823,
"spearman": 0.5279642783201307
},
{
"cosine_pearson": 0.45567578510194695,
"cosine_spearman": 0.4759593706055199,
"euclidean_pearson": 0.5070517972856654,
"euclidean_spearman": 0.48261827460057777,
"hf_subset": "ar-ar",
"languages": [
"ara-Arab"
],
"main_score": 0.4759593706055199,
"manhattan_pearson": 0.5177089933394046,
"manhattan_spearman": 0.4914281145552357,
"pearson": 0.45567578510194695,
"spearman": 0.4759593706055199
},
{
"cosine_pearson": 0.15581738888547184,
"cosine_spearman": 0.1495641486110853,
"euclidean_pearson": 0.1349394931764104,
"euclidean_spearman": 0.13081927359268367,
"hf_subset": "en-tr",
"languages": [
"eng-Latn",
"tur-Latn"
],
"main_score": 0.1495641486110853,
"manhattan_pearson": 0.12970038975673026,
"manhattan_spearman": 0.14147949612647964,
"pearson": 0.15581738888547184,
"spearman": 0.1495641486110853
},
{
"cosine_pearson": 0.5667402604578623,
"cosine_spearman": 0.5826079418593232,
"euclidean_pearson": 0.38450094531647017,
"euclidean_spearman": 0.40102938983888436,
"hf_subset": "fr-en",
"languages": [
"fra-Latn",
"eng-Latn"
],
"main_score": 0.5826079418593232,
"manhattan_pearson": 0.4720522029904462,
"manhattan_spearman": 0.47395940988793417,
"pearson": 0.5667402604578623,
"spearman": 0.5826079418593232
},
{
"cosine_pearson": 0.5427131797486067,
"cosine_spearman": 0.5631847685301092,
"euclidean_pearson": 0.34861333517971227,
"euclidean_spearman": 0.32753161608389836,
"hf_subset": "it-en",
"languages": [
"ita-Latn",
"eng-Latn"
],
"main_score": 0.5631847685301092,
"manhattan_pearson": 0.40648821394761325,
"manhattan_spearman": 0.3918795987336719,
"pearson": 0.5427131797486067,
"spearman": 0.5631847685301092
}
]
},
"task_name": "STS17"
}

Could you please help me understand why there is such a discrepancy? Is there any additional setup or configuration that I might have missed?

@vaibhavad
Copy link
Collaborator

Hi @newfish-lab,

The STS17 task in current version of MTEB is a crosslingual task. However, the leaderboard only considers the English subset. As you can see in your results file, there are multiple language subsets (denoted by hf_subset). The score in en-en in the above results is 81.69. The reported score is 81.672.

To select specific subset of a task, please refer to the documentation on MTEB library.

Let me know if you have any more questions.

@newfish-lab
Copy link
Author

newfish-lab commented Oct 21, 2024

Thank you so much! I also wanted to know how you trained the uni+simcse of S-LLaMA-1.3B. Did you simply set the is_causal attribute of the attention to true and comment out the overridden _update_causal_mask function? I tried this way, but the performance was quite poor.

@newfish-lab
Copy link
Author

newfish-lab commented Oct 21, 2024

Moreover, I have encountered another issue: the validation code seems to have problems when testing with other task:
Command Used:
python experiments/mteb_eval.py --model_name McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp-unsup-simcse \ --task_name SciFact \ --task_to_instructions_fp test_configs/mteb/task_to_instructions.json \ --output_dir results

Error: TypeError: LLM2Vec.encode() got an unexpected keyword argument 'task_name'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants