-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't get the same acc with the pytorch transform in iNaturalist2018 dataset #5721
Comments
Hi @lyyi599, Thank you for reaching out and using DALI.
DALI file reader does the following:
So it should not favor less-represented samples. Can you gather statistics of the classes DALI reader returns? Do they match the sample distribution in the dataset? |
Hi @JanuszL , Thank you for your reply. After my verification, it is indeed the case that each epoch samples the data from the train set once for training, and they match the sample distribution in the dataset. However, the code still produces the results mentioned earlier. For example, in first epoch, the accuracy obtained on val dataset using DALI is as follows: many: 20.0%, med: 21.0%, few: 22.2%, average: 25.4%, while the results using PyTorch transforms are: many: 3.4%, med: 9.3%, few: 16.5%, average: 11.5%. This is very confusing to me. The main training code I'm currently using is as follows (including the pipeline and transform). Simply switching the dali_dataset flag to include iNaturalist2018 (which corresponds to using the pipeline and transform for data loading) results in the significant accuracy drop mentioned above. Could you help me check if I am using DALI correctly?trainer.py.txt It is worth mentioning that the iNaturalist2018 dataset uses a .txt file for indexing, so the create_dali_pipeline function includes the corresponding data_list_dir parameter for reading it. Thank you for any suggestions you may have. |
Hello @lyyi599 , |
Hi @mzient , Thank you for your reply, and I apologize for not explaining the issue more clearly. Let me provide some context for the code: this is about long-tail recognition. Many real-world problems exhibit a long-tail distribution, meaning that in the training process, the number of samples per class varies. The iNaturalist2018 dataset is an example of this, and it has been widely studied. You can find existing examples here: https://github.com/shijxcs/LIFT/tree/661ead9b78368f05ba79abe4672d63154467f823/datasets/iNaturalist18. Therefore, the issue is likely not related to the .txt file used for indexing. In the trainer.py code above, the only difference is how iNaturalist18 is loaded—specifically, the use of the pipeline and dataloader. This difference is causing the accuracy drop, so I suspect that there might be an issue with how I’m using DALI, or there could be some bugs in DALI that I haven’t identified. For comparison, the Imagenet_LT dataset is also indexed using a .txt file, and it uses a similar pipeline and dataloader approach, but I am able to obtain comparable results. |
We are not saying it is the cause, but we want to make sure we are looking at the same things. If the index file misses some samples then DALI will not return them underrepresenting some classes and overrepresenting others. If you can also provide a way you generated these or the files itself it would be great. |
I get you. I download the files from the repo: https://github.com/shijxcs/LIFT/tree/661ead9b78368f05ba79abe4672d63154467f823/datasets/iNaturalist18 , and I ensured that the file properly indexes the corresponding data. |
Hello @lyyi599 |
Thank you for your reply. Sorry for my late response because I made many changes these days for debugging. In order to ensure it runs better, I retested the code, and it is now complete. First, I would like to point out that the iNaturalist 2018 dataset is quite large, and I think you may encounter difficulties running the code due to the potentially long time it takes to download the dataset. Therefore, I have kept the log files that were mentioned in our previous discussion, which encountered issues, so you can review them for reference. [file0] MLA\output\inat2018_clip_vit_b16_adaptformer_True_num_epochs_20\log.txt-2024-11-20-16-35-42:The complete log of training with PyTorch transforms,accuracy: 79.4% [file1] MLA\output\inat2018_clip_vit_b16_adaptformer_True_loss_type_LA_num_epochs_20_lr_0.01\log.txt-2024-11-19-20-24-00:The complete log of training with DALI,many: 40.7% med: 67.0% few: 75.3% average: 67.6% [file2] MLA\output\inat2018_clip_vit_b16_adaptformer_True_loss_type_LA_num_epochs_20_lr_0.01\log.txt-2024-11-22-14-51-57:The one-epoch log of training with PyTorch transforms,many: 20.0% med: 21.0% few: 22.2% [file3] MLA\output\inat2018_clip_vit_b16_adaptformer_True_loss_type_LA_num_epochs_20_lr_0.01\log.txt-2024-11-22-16-21-59:The one-epoch log of training with DALI,many: 3.4% med: 9.3% few: 16.5% From the log, we can see that file0 has a 10-point higher accuracy compared to file1. If only one epoch is run, file2 also performs significantly better than file3. Additionally, when looking at file3, there are substantial differences in the training process, with the performance on "many" categories and "few" categories showing initial growth. I appreciate everyone’s outstanding work, as the training time has been reduced from 2 days to 8 hours (using a single 3090 GPU). However, for some currently unknown reasons, there has been a significant drop in accuracy, which is troubling me. If it's convenient for you to run the code, I have published the runnable code in the following repository [https://github.com/lyyi599/MLA]. You can refer to the repository's README for instructions on how to run it. In any case, thank you for your help. If you have any good solutions, or if I am using DALI incorrectly, please let me know. Thank you again. |
Thanks for this awesome tool.
Describe the question.
For my experiments, I need to modify a piece of open-source code. Due to the time-consuming nature of the transformation and the large size of the iNaturalist2018 dataset (over 400,000 images), I plan to switch to using DALI for data loading.
Below is the original open-source code snippet.(from:https://github.com/shijxcs/LIFT/blob/661ead9b78368f05ba79abe4672d63154467f823/trainer.py#L102)
Referring to the DALI example(
DALI/docs/examples/use_cases/pytorch/resnet50/main.py
Line 275 in 4562f15
As mentioned earlier, the code has been validated on the ImageNet_LT dataset. However, when applied to the iNaturalist2018 dataset, it results in significant differences, specifically:
In contrast, using the DALI pipeline prioritizes learning from tail classes (categories with fewer samples that are harder to collect). The corresponding log is as follows:
It is evident that the learning processes of the two methods differ significantly. However, such a large discrepancy is not observed on the ImageNet_LT dataset.
Of course, after a certain number of epochs, the logs for both PyTorch transforms and the DALI pipeline eventually show lower accuracy for head classes and higher accuracy for tail classes. However, the overall accuracy drops by over 10%, which is an unacceptable difference.
Possible Explanations
The sampling process of the pipeline and the dataloader is different. However, I don't think it would cause such a significant discrepancy.
Heeeelp
Thank you for any suggestions!
Check for duplicates
The text was updated successfully, but these errors were encountered: