Limit the amount of data used for distillation #905

gregtatum · 2024-10-29T15:18:11Z

In #771 I ran an experiment to see the effects of the size of the distillation corpus for the change in the COMET score for the students. Adding more data to this step did not affect the COMET score beyond the standard deviation (±0.12 COMET) of training student models.

Synthesizing the training pairs from the monolingual data is one of the more expensive parts of the pipeline, so we should limit the amount of data we throw at it.

For this work we need to:

Determine the threshold that we cut off.
Determine how we mix the source part of the parallel corpus, and the source monolingual data.

1. Threshold cut-off

In our 1:1 @eu9ene proposed 50 million, which feels like a reasonable initial threshold to me. He mentioned that we shouldn't 100% rely on the evaluation metrics since more data diversity could create a better general translation model for translating the web. There is a risk that our evaluation data is not diverse enough to capture this, so we should be conservative in how much we cut off.

I think we can probably go even lower if we wanted, as the results were the same for 30M in da-en. I have an experiment still running with 1M and 10,000 to further test the limits here.

We should verify that these results still hold for a Balto-Slavic language, like en-lt.

2. How to mix

I'm not sure how we want to mix our data or if @eu9ene has thoughts here. We could collect all of our source parallel data and all of the monolingual available, and then mix and truncate it. This is what I was doing in my experiment.

It's likely that we'll have more parallel source data than the 50 million cut-off for many languages.

The text was updated successfully, but these errors were encountered:

ZJaume · 2024-10-30T13:25:10Z

It may be that the effect of adding syntheticly translated monolingual data is more noticeable if the language pair is low/mid-resource. Backtranslation usually has a big impact in low-resource.

eu9ene · 2024-10-30T18:21:22Z

It may be that the effect of adding syntheticly translated monolingual data is more noticeable if the language pair is low/mid-resource. Backtranslation usually has a big impact in low-resource.

We currently don't use back-translations for distillation.

eu9ene · 2024-10-30T18:30:49Z

I think when we train a "tiny" student model it has limited complexity and adding more data doesn't help at some point. So the model kind of underfits. When we increase the model size to the "base" architecture it will be a completely different picture. Also, I'm pretty sure it's different for each language.

Just to clarify, if I understand correctly @gregtatum is talking about limiting the whole mix of original corpus + mono data, not only the mono part.

We can't run such an experiment for each language and config. With all that said I'd rather oversupply the data because undersupplying it risks losing quality without knowing it both on evaluation benchmarks and in the wild. I'd rather show the model more diverse data than train it in a loop of multiple epochs on the same data. I'd say 50M sounds like too low to me. Maybe 200M or so would be safer. Just a guess.

From a cost-efficiency perspective, I recommend focusing on #453 first. There's a lot of GPU underutilizing etc. there.

gregtatum added the cost & perf Speeding up and lowering cost for the pipeline label Oct 29, 2024

gregtatum mentioned this issue Oct 30, 2024

[meta] Kick off a 2024-H2 training run #912

Open

marco-c mentioned this issue Nov 7, 2024

[meta] Cost efficiency #453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit the amount of data used for distillation #905

Limit the amount of data used for distillation #905

gregtatum commented Oct 29, 2024

ZJaume commented Oct 30, 2024

eu9ene commented Oct 30, 2024

eu9ene commented Oct 30, 2024

Limit the amount of data used for distillation #905

Limit the amount of data used for distillation #905

Comments

gregtatum commented Oct 29, 2024

1. Threshold cut-off

2. How to mix

ZJaume commented Oct 30, 2024

eu9ene commented Oct 30, 2024

eu9ene commented Oct 30, 2024