Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for FinMTEB benchmark #1379

Open
wants to merge 7 commits into
base: v2.0.0
Choose a base branch
from

Conversation

alt-glitch
Copy link

@alt-glitch alt-glitch commented Nov 4, 2024

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding datasets checklist

Reason for dataset addition:

Fixes #1267

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.

    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
      • Ran only on FiQAClassification as of now.
    • intfloat/multilingual-e5-small
      • Ran only on FINAL as of now.
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).

  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()

  • I have filled out the metadata object in the dataset file (find documentation on it here).

  • Run tests locally to make sure nothing is broken using make test.

  • Run the formatter to format the code using make lint.

@alt-glitch
Copy link
Author

alt-glitch commented Nov 4, 2024

Hey @Muennighoff @KennethEnevoldsen @isaac-chung!

Here's a WIP PR to close #1267.

I had a few questions/notes:

  1. Should I run and get the results for all the tasks?
  2. Should the relevant PRs to embeddings-benchmark/results and embeddings-benchmark/leaderboard be made after merging this PR?
  3. FiQA2018 is already in MTEB, so I have left that out from FinMTEB. Otherwise, there were no conflicting tasks.
  4. Some tasks don't have a reference URL.
  5. The Summarization tasks are still pending. I have yet to look into the changes highlighted by @yixuantt in Add FinMTEB #1267 for summarization.

I'll add the summarization changes and make the PRs to results and leaderboard once this is done.
Is there anything else I'm missing out on?

@isaac-chung
Copy link
Collaborator

Hi @alt-glitch , thanks for working on this!

  1. Yes, I'd suggest running the whole thing on a small model mentioned in the paper like all-MiniLM-L12-v2, and only using the quickest settings as a sanity check, e.g. n_experiments=1 for classification.
  2. Afterwards for the leaderboard yes. I'll leave the results repo part to @KennethEnevoldsen
  3. Sounds good.
  4. I think it's ok to use the paper's URL or its GitHub URL as reference. Otherwise, there are individual references for each dataset mentioned in the paper.
  5. Re: summarization task, we can add column names as a class attributes to AbstaskSummarization like the way we did in MIEB's AbsTaskImageClassification.

Let me know if anything is unclear.

@KennethEnevoldsen
Copy link
Contributor

Re. 2: PRs to embeddings-benchmark/results can be made after this PR. I don't believe a PR to embeddings-benchmark/leaderboard will be required once the new leaderboard is up and running as long as the benchmark is added to it is added to benchmarks.py

@alt-glitch
Copy link
Author

Thanks for the comments!

Some more info:

  1. The main_score of PairClassification tasks needed to be fixed to max_ap.
  2. Summarization tasks use STSEvaluator
  3. Added reference_summaries_column and generated_summaries_column for specifying column names.
  4. Added Summarization tasks.
  5. Added some more Clustering tasks that I had missed out.
  6. I'm currently taking a look at the results from the sanity run I did through the benchmark. All tasks ran fine.

I'm interested in helping with getting the results too!

cc @KennethEnevoldsen @isaac-chung

mteb/abstasks/AbsTaskSTS.py Outdated Show resolved Hide resolved
mteb/tasks/Classification/eng/ESGClassification.py Outdated Show resolved Hide resolved
from mteb.abstasks.TaskMetadata import TaskMetadata


class FOMCClassification(AbsTaskClassification):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: Meta data is required to be filled out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah understood. Working on it.

mteb/tasks/Classification/zho/FinNSPClassification.py Outdated Show resolved Hide resolved
@KennethEnevoldsen
Copy link
Contributor

Not sure what is meant by:

Summarization tasks use STSEvaluator

@alt-glitch
Copy link
Author

Not sure what is meant by:

Summarization tasks use STSEvaluator

Summarisation tasks here don't have human_summaries or relevance scores. So the Spearman correlation is calculated between summary and text. Hence the STSEvaluator is used.

See: yixuantt/FinMTEB#2


  1. Updated AbsTaskSTS, AbsTaskSummarization, AbsTaskPairClassification and the respective tasks to use configurable column names instead of dataset_transform
  2. Added missing reference to the tasks.
  3. I'm going to work on filling out the metadata for the tasks @KennethEnevoldsen. I'll update you once I'm done.

@alt-glitch
Copy link
Author

alt-glitch commented Nov 10, 2024

Update: It's taking me a couple more days to fill out all the metadata fields for this benchmark as this seems to be mostly a manual process — reading the paper referenced for each dataset to understand and derive the date of dataset creation, annotation creators, and sample creation process since there are 64 datasets :)

If there's something I'm missing, do let me know!

Thanks!

@KennethEnevoldsen
Copy link
Contributor

Update: It's taking me a couple more days to fill out all the metadata fields for this benchmark as this seems to be mostly a manual process — reading the paper referenced for each dataset to understand and derive the date of dataset creation, annotation creators, and sample creation process since there are 64 datasets :)

Thanks for taking the time on this. I believe metadata is the only thing missing and then it can be reviewed and merged.

@KennethEnevoldsen KennethEnevoldsen changed the base branch from main to v2.0.0 November 11, 2024 09:27
@KennethEnevoldsen
Copy link
Contributor

Moving this to v2.0.0 to avoid merge conflict in the future. I can resolve the current merge conflicts one metadata is added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add FinMTEB
3 participants