Use different batch sizes in CombinedStreamingDataset #327

schopra8 · 2024-08-10T17:04:35Z

🚀 Feature

CombinedStreamingDataset allows you to combine multiple StreamingDatasets with a sampling ratio -- but it assumes that that the batch_size is the same for each dataset.

Motivation

If the different datasets have tensors of different sizes, it would be great to use different batch sizes per dataset to maximize throughput / memory consumption (e.g. batch size of 1 for dataset with larger input tensors, batch size of 2 for dataset with smaller input tensors).

Pitch

Allow set_batch_size to take a list of batch_sizes -- one per dataset.

Alternatives

One thing that that would need to be considered would be gradient accumulation. For example, if dataset A is large tensors, with only 1 fitting in memory per batch and dataset B has small tensors, with 4 fitting in memory per batch, you would want to do 4 steps of gradient accumulation when acting on samples from dataset A if you want a 50-50 split during training between dataset A and dataset B. If you want a different ratio samples from dataset A vs. dataset B, you'd need to be able to make this number of gradient accumulation steps configurable.

Additional context

The text was updated successfully, but these errors were encountered:

tchaton · 2024-08-13T07:23:27Z

Hey @schopra8. Feel free to make a contribution. The main challenge will be to ensure fault tolerance works properly.

schopra8 added enhancement New feature or request help wanted Extra attention is needed labels Aug 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use different batch sizes in CombinedStreamingDataset #327

Use different batch sizes in CombinedStreamingDataset #327

schopra8 commented Aug 10, 2024

tchaton commented Aug 13, 2024 •

edited

Loading

Use different batch sizes in CombinedStreamingDataset #327

Use different batch sizes in CombinedStreamingDataset #327

Comments

schopra8 commented Aug 10, 2024

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

tchaton commented Aug 13, 2024 • edited Loading

tchaton commented Aug 13, 2024 •

edited

Loading