In what setting should I be concern for reproducibility or when should I be concern for reproducibility? #1994
Replies: 5 comments
-
@Phuoc-Hoan-Le no, I have not observed this behaviour in a 'stable' training setup. Even without being more explicit in the determinism (setting extra flags), same seed is enough to get < +/- 0.2-3% I feel. Not sure I've ever seen 3% swing changing the seed itself. Now, if you're on the edge of being stable, it's possible for smaller perturbations to tip the balance towards collapsing/blowing up... Can you reproduce the run-to-run variation in a completely unmodified vit small? |
Beta Was this translation helpful? Give feedback.
-
@rwightman thank you for your response. Unfortunately, I don't have the resources to run that kind of experiment right now, but in the past, I have trained an unmodified DeiT-S 4 times and a couple of different types of unmodified ViTs before such as DeiT-T, DeiT-B, PVT-S, etc on different machines, using the full DeiT hyperparameters such as drop-path=0.1, weight-decay=0.05, warmup-epochs=5, aa=rand-m9-mstd0.5-inc1, repeated-aug, reprob=0.25, mixup=0.8, cutmix=1.0, etc. and I have never experience a case where my training would stall or temporarily blow up and I would be able to get the same accuracy as the one in the paper. Also, I have trained a "variant A" of a DeiT-S with no problem before, but with AdamW with zero weight decay, cosine learning rate decay with 0 warmup epochs, no regularization of any type, and no data augmentation, except for RandomResizedCrop and RandomHorizontalFlip. But for a "variant B" of a DeiT with that same hyperparameter as "variant A", I would experience a training curve that would be fine for a bit and then it would stall or even blow up a bit for a few epochs (from epoch 20 to epoch 40) for 30% of my trials. In the other 70% of the trial, that same "variant B" model would train perfectly fine and the accuracy would be much better than "variant A". Do you think that one of the reasons for instability in my training for 30% of my trials could stem from the lack of regularization, lack of data augmentation, and/or lack of warmup in my hyperparameter which causes "variant B" training to be on the edge of stable? Do you think that I should reintroduce some regularization such as weight decay or warmup back into the model? |
Beta Was this translation helpful? Give feedback.
-
@rwightman Any thoughts? |
Beta Was this translation helpful? Give feedback.
-
@Phuoc-Hoan-Le I really can't say without looking carefully but I've not observed and significant concerns in this area as I've mentioned and I've trained and fine-tuned a LOT of models. I'd never train any ViT based arch without gradient clipping from scratch, so not sure if you have that, it's typically unstable without unless you have a really really low LR. No WD is also not the best for from scratch. So yeah, probably not a stable training setup? |
Beta Was this translation helpful? Give feedback.
-
@Phuoc-Hoan-Le moving to discussion as no actionable issue |
Beta Was this translation helpful? Give feedback.
-
Have you ever experienced any huge variation in your training results by around 3% accuracy for the same type of ViT model on ImageNet when training for multiple trials (2 to 3 different trials) in a multi-node setting compared to just multi-gpu setting? or in general? If yes, does this happen only for the first few epochs, or does this happen for the whole training process?
For my case, the training curves for different trials for the same type of model would be similar for the first 15 epochs, and then after that, the training curves for the different trials would dramatically differ, where one training curve would stall or even blow up a bit for few epochs (from epoch 20 to epoch 40) and the other training curve would be consistently going down. Why does this happen?
To maybe give you some clues, I am training a variant of DeiT-S model that contains batch-normalization instead of layer normalization using your DeiT code training pipeline from the Github. I am also training with AdamW with zero weight decay for 300 epochs and using cosine learning rate decay with 0 warmup epochs. Also, I don't use any regularization or data augmentation, except for RandomResizedCrop and RandomHorizontalFlip. These hyperparameters are specifically chosen to be consistent with previous works in that variant of ViT.
With regards to training a normal DeiT-S with original DeiT hyperparameters, I am not sure if this huge variation would also occur, since I have only trained this model like only 3 times in my lifetime.
Beta Was this translation helpful? Give feedback.
All reactions