In what setting should I be concern for reproducibility or when should I be concern for reproducibility? #1994

Phuoc-Hoan-Le · 2023-10-11T17:19:18Z

Phuoc-Hoan-Le
Oct 11, 2023

Have you ever experienced any huge variation in your training results by around 3% accuracy for the same type of ViT model on ImageNet when training for multiple trials (2 to 3 different trials) in a multi-node setting compared to just multi-gpu setting? or in general? If yes, does this happen only for the first few epochs, or does this happen for the whole training process?

For my case, the training curves for different trials for the same type of model would be similar for the first 15 epochs, and then after that, the training curves for the different trials would dramatically differ, where one training curve would stall or even blow up a bit for few epochs (from epoch 20 to epoch 40) and the other training curve would be consistently going down. Why does this happen?

To maybe give you some clues, I am training a variant of DeiT-S model that contains batch-normalization instead of layer normalization using your DeiT code training pipeline from the Github. I am also training with AdamW with zero weight decay for 300 epochs and using cosine learning rate decay with 0 warmup epochs. Also, I don't use any regularization or data augmentation, except for RandomResizedCrop and RandomHorizontalFlip. These hyperparameters are specifically chosen to be consistent with previous works in that variant of ViT.

With regards to training a normal DeiT-S with original DeiT hyperparameters, I am not sure if this huge variation would also occur, since I have only trained this model like only 3 times in my lifetime.

rwightman · 2023-10-11T17:32:43Z

rwightman
Oct 11, 2023
Maintainer

@Phuoc-Hoan-Le no, I have not observed this behaviour in a 'stable' training setup. Even without being more explicit in the determinism (setting extra flags), same seed is enough to get < +/- 0.2-3% I feel. Not sure I've ever seen 3% swing changing the seed itself.

Now, if you're on the edge of being stable, it's possible for smaller perturbations to tip the balance towards collapsing/blowing up...

Can you reproduce the run-to-run variation in a completely unmodified vit small?

0 replies

Phuoc-Hoan-Le · 2023-10-14T00:19:49Z

Phuoc-Hoan-Le
Oct 14, 2023
Author

@rwightman thank you for your response. Unfortunately, I don't have the resources to run that kind of experiment right now, but in the past, I have trained an unmodified DeiT-S 4 times and a couple of different types of unmodified ViTs before such as DeiT-T, DeiT-B, PVT-S, etc on different machines, using the full DeiT hyperparameters such as drop-path=0.1, weight-decay=0.05, warmup-epochs=5, aa=rand-m9-mstd0.5-inc1, repeated-aug, reprob=0.25, mixup=0.8, cutmix=1.0, etc. and I have never experience a case where my training would stall or temporarily blow up and I would be able to get the same accuracy as the one in the paper.

Also, I have trained a "variant A" of a DeiT-S with no problem before, but with AdamW with zero weight decay, cosine learning rate decay with 0 warmup epochs, no regularization of any type, and no data augmentation, except for RandomResizedCrop and RandomHorizontalFlip. But for a "variant B" of a DeiT with that same hyperparameter as "variant A", I would experience a training curve that would be fine for a bit and then it would stall or even blow up a bit for a few epochs (from epoch 20 to epoch 40) for 30% of my trials. In the other 70% of the trial, that same "variant B" model would train perfectly fine and the accuracy would be much better than "variant A".

Do you think that one of the reasons for instability in my training for 30% of my trials could stem from the lack of regularization, lack of data augmentation, and/or lack of warmup in my hyperparameter which causes "variant B" training to be on the edge of stable? Do you think that I should reintroduce some regularization such as weight decay or warmup back into the model?

0 replies

Phuoc-Hoan-Le · 2023-10-16T17:05:14Z

Phuoc-Hoan-Le
Oct 16, 2023
Author

@rwightman thank you for your response. Unfortunately, I don't have the resources to run that kind of experiment right now, but in the past, I have trained an unmodified DeiT-S 4 times and a couple of different types of unmodified ViTs before such as DeiT-T, DeiT-B, PVT-S, etc on different machines, using the full DeiT hyperparameters such as drop-path=0.1, weight-decay=0.05, warmup-epochs=5, aa=rand-m9-mstd0.5-inc1, repeated-aug, reprob=0.25, mixup=0.8, cutmix=1.0, etc. and I have never experience a case where my training would stall or temporarily blow up and I would be able to get the same accuracy as the one in the paper.

Also, I have trained a "variant A" of a DeiT-S with no problem before, but with AdamW with zero weight decay, cosine learning rate decay with 0 warmup epochs, no regularization of any type, and no data augmentation, except for RandomResizedCrop and RandomHorizontalFlip. But for a "variant B" of a DeiT with that same hyperparameter as "variant A", I would experience a training curve that would be fine for a bit and then it would stall or even blow up a bit for a few epochs (from epoch 20 to epoch 40) for 30% of my trials. In the other 70% of the trial, that same "variant B" model would train perfectly fine and the accuracy would be much better than "variant A".

Do you think that one of the reasons for instability in my training for 30% of my trials could stem from the lack of regularization, lack of data augmentation, and/or lack of warmup in my hyperparameter which causes "variant B" training to be on the edge of stable? Do you think that I should reintroduce some regularization such as weight decay or warmup back into the model?

@rwightman Any thoughts?

0 replies

rwightman · 2023-10-17T17:07:56Z

rwightman
Oct 17, 2023
Maintainer

@Phuoc-Hoan-Le I really can't say without looking carefully but I've not observed and significant concerns in this area as I've mentioned and I've trained and fine-tuned a LOT of models. I'd never train any ViT based arch without gradient clipping from scratch, so not sure if you have that, it's typically unstable without unless you have a really really low LR. No WD is also not the best for from scratch. So yeah, probably not a stable training setup?

0 replies

rwightman · 2023-10-18T23:40:25Z

rwightman
Oct 18, 2023
Maintainer

@Phuoc-Hoan-Le moving to discussion as no actionable issue

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In what setting should I be concern for reproducibility or when should I be concern for reproducibility? #1994

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

In what setting should I be concern for reproducibility or when should I be concern for reproducibility? #1994

Phuoc-Hoan-Le Oct 11, 2023

Replies: 5 comments

rwightman Oct 11, 2023 Maintainer

Phuoc-Hoan-Le Oct 14, 2023 Author

Phuoc-Hoan-Le Oct 16, 2023 Author

rwightman Oct 17, 2023 Maintainer

rwightman Oct 18, 2023 Maintainer

Phuoc-Hoan-Le
Oct 11, 2023

rwightman
Oct 11, 2023
Maintainer

Phuoc-Hoan-Le
Oct 14, 2023
Author

Phuoc-Hoan-Le
Oct 16, 2023
Author

rwightman
Oct 17, 2023
Maintainer

rwightman
Oct 18, 2023
Maintainer