Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the amount of data used for distillation #905

Open
Tracked by #453 ...
gregtatum opened this issue Oct 29, 2024 · 3 comments
Open
Tracked by #453 ...

Limit the amount of data used for distillation #905

gregtatum opened this issue Oct 29, 2024 · 3 comments
Labels
cost & perf Speeding up and lowering cost for the pipeline

Comments

@gregtatum
Copy link
Member

In #771 I ran an experiment to see the effects of the size of the distillation corpus for the change in the COMET score for the students. Adding more data to this step did not affect the COMET score beyond the standard deviation (±0.12 COMET) of training student models.

Synthesizing the training pairs from the monolingual data is one of the more expensive parts of the pipeline, so we should limit the amount of data we throw at it.

For this work we need to:

  1. Determine the threshold that we cut off.
  2. Determine how we mix the source part of the parallel corpus, and the source monolingual data.

1. Threshold cut-off

In our 1:1 @eu9ene proposed 50 million, which feels like a reasonable initial threshold to me. He mentioned that we shouldn't 100% rely on the evaluation metrics since more data diversity could create a better general translation model for translating the web. There is a risk that our evaluation data is not diverse enough to capture this, so we should be conservative in how much we cut off.

I think we can probably go even lower if we wanted, as the results were the same for 30M in da-en. I have an experiment still running with 1M and 10,000 to further test the limits here.

We should verify that these results still hold for a Balto-Slavic language, like en-lt.

2. How to mix

I'm not sure how we want to mix our data or if @eu9ene has thoughts here. We could collect all of our source parallel data and all of the monolingual available, and then mix and truncate it. This is what I was doing in my experiment.

It's likely that we'll have more parallel source data than the 50 million cut-off for many languages.

@gregtatum gregtatum added the cost & perf Speeding up and lowering cost for the pipeline label Oct 29, 2024
@ZJaume
Copy link
Collaborator

ZJaume commented Oct 30, 2024

It may be that the effect of adding syntheticly translated monolingual data is more noticeable if the language pair is low/mid-resource. Backtranslation usually has a big impact in low-resource.

@eu9ene
Copy link
Collaborator

eu9ene commented Oct 30, 2024

It may be that the effect of adding syntheticly translated monolingual data is more noticeable if the language pair is low/mid-resource. Backtranslation usually has a big impact in low-resource.

We currently don't use back-translations for distillation.

@eu9ene
Copy link
Collaborator

eu9ene commented Oct 30, 2024

I think when we train a "tiny" student model it has limited complexity and adding more data doesn't help at some point. So the model kind of underfits. When we increase the model size to the "base" architecture it will be a completely different picture. Also, I'm pretty sure it's different for each language.

Just to clarify, if I understand correctly @gregtatum is talking about limiting the whole mix of original corpus + mono data, not only the mono part.

We can't run such an experiment for each language and config. With all that said I'd rather oversupply the data because undersupplying it risks losing quality without knowing it both on evaluation benchmarks and in the wild. I'd rather show the model more diverse data than train it in a loop of multiple epochs on the same data. I'd say 50M sounds like too low to me. Maybe 200M or so would be safer. Just a guess.

From a cost-efficiency perspective, I recommend focusing on #453 first. There's a lot of GPU underutilizing etc. there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cost & perf Speeding up and lowering cost for the pipeline
Projects
None yet
Development

No branches or pull requests

3 participants