-
-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training on a large set is much slower than on a smaller set - proportionally #1624
Comments
Thanks Jan. Not quite sure. I was just (like 2 minutes ago) discussing with a colleague about the fact that pyannote is missing some kind of profiling to debug this kind of behavior. Did you train for multiple epochs or just one to report this number? The initial data loading takes a veeeeery long time so that's why I ask (and that's also the point of the caching mechanism that has just been merged). Could also be related to system builtin caching mechanism where recently opened files are faster to access. Could also be related to file formats that have no fast seek to specific time IO. Are all your files in the same format? Whatever you find, I'd love to know about it so that we can fix it. cc'ing @flyingleafe just in case. |
Thank you for the response.
Obviously, I need to run this on |
So I ran the same code on 2000 hours of training audios. There is a huge shift in
|
Let me know if/how I can help. |
Not sure how to profile this in some advanced way, but with some manual profiling I believe that one of the issues is on this line
(please note that this is 3.1.0) - I used a different data structure (dictionary mapping file_id) and got an improvement for 2k hours set, from the previous 61.26% to 41.13%. But I can see that caching also touched the code I just modified, would you recommend updating?
|
It is a good idea to do your tests with the latest Abouth this line of code: annotations = self.annotations[self.annotations["file_id"] == file_id] It is indeed far from being efficient. Instead of using a Would you give it go and open a PR? Something that would look like (!!! untested code !!!) start_idx, end_idx = np.searchsorted(self.annotations["file_id"], [file_id, file_id + 1])
annotations = self.annotations[start_idx:end_idx] |
This is a version with the current
|
I ran the current
which is very close to np.searchsorted implementation. Again, I tried an implementation with the dictionary (this time it was more difficult and took me a while to realize what was going on with caching and why my dictionary was empty np.ndarray suddenly) ...
Anyway, seems like the new |
Thanks Jan! Would be great indeed if you could run with SSD with the 3 versions:
|
So I ran all three on a local volume with the highest possible disk setup on AWS
|
But what is strange, when I change the number of workers from 2 to 4 (I have 4 physical cores and 8 threads per GPU), numbers change dramatically. This is
|
Oh. I did not realize you were using such a small number of workers. |
I was starting on g4dn.xlarge which has half of the resources of p3.2xlarge, but that drop when increasing the number of workers is huge, I was not expecting that. So far it seems like a sweet spot is the number of threads (usually twice the number of cores). But since my goal is to train on 26k hours, with the current Since one epoch is taking quite a long time, I can't run the whole profiler, so these are just approximate numbers. |
I'd be curious to have a look at your Regarding your last point (long epoch), you could actually use the |
I am trying to push my changes to a branch named git push --set-upstream origin segments_dict I got
git remote -v says
Sorry to bother you with this. Any ideas about what might be wrong here? ssh key has been added on GitHub for quite some time. |
I guess you need to fork the repo, push to your own fork, and open a PR from it? |
Right, makes sense. Thank you. |
Closing this, since this is only critical for large amounts of data. |
Re-opening it as I think it is worth looking into it (and also there's your related PR that I still need to have a look at). |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Tested versions
Running 3.1
System information
ubuntu 20.04, GPU V100 (p3.2x instance)
Issue description
Bonjour Hervé,
I noticed that when training PyanNet on a large set, training speed deteriorates significantly. I have a training and development set (statistics below).
Train:
Dev:
When I train on training, one epoch takes 1 day, 17 hours, around 1.05it/s.
When I swap training for dev, one epoch takes 17 minutes, showing around 6.50it/s.
I have ~48x more audio in training, however, if I iterated 48 times over the development set, it would take me ~13.5 hours, which is around 3 times faster than training on a
train
.Do you have some ideas where this comes from? Both sets are on the same disk. I am going to investigate further, I just wanted to know if you have an idea where to start.
Thanks.
-Jan
Minimal reproduction example (MRE)
can't share my data, sorry
The text was updated successfully, but these errors were encountered: