A potential problem on training batch size #164

tomato18463 · 2024-05-11T07:34:47Z

Hi,

Thank you for this exciting project! I have been using it for some time and I have recently found a potential problem which may lead to too small batch size in some training epochs. Please feel free to correct me if any of my understanding is wrong.

I note the batch size for dataloader is set to None as here. So I think each worker will return a batch with the size defined in the config file (like here). As a result, the last batch given by each worker may be very small, if the length of data list return by the sampling step is not divisible by the batch size and the reminder is small. The number of such small batches equals to the number of workers. This leads to some noisy backward passes with small batch size. If the amount of training data is small, the effect of these backward passes may be non-neglectable (think of an extreme case: 8 workers, batch size 128, and the data list of each worker has 129 items; then it is 8 updates with batch size 128 and 8 updates with batch size 1).

What do you think of it? Thanks!

mlxu995 · 2024-05-14T11:01:18Z

Thank you for bringing this potential issue to our attention.

It is important to consider the impact of small batch sizes on the training process, especially for some tasks with imbalanced data distribution.
If your models fail to converge because of this problem, one potential solution is to set a fixed batch size for the dataloader, ensuring that all batches have the same size and minimizing the impact of noisy gradients.
Also, for a dataset which contains enough training samples, we are interested in whether this training process leads to better performance.

Thank you for your contribution and for helping to improve the project!

tomato18463 · 2024-05-17T00:47:36Z

I see. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A potential problem on training batch size #164

A potential problem on training batch size #164

tomato18463 commented May 11, 2024

mlxu995 commented May 14, 2024

tomato18463 commented May 17, 2024

A potential problem on training batch size #164

A potential problem on training batch size #164

Comments

tomato18463 commented May 11, 2024

mlxu995 commented May 14, 2024

tomato18463 commented May 17, 2024