Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use records_per_epoch for scheduling? 🤔[question] #6869

Open
charles-viss opened this issue May 17, 2023 · 9 comments
Open

Unable to use records_per_epoch for scheduling? 🤔[question] #6869

charles-viss opened this issue May 17, 2023 · 9 comments
Labels

Comments

@charles-viss
Copy link

Describe your question

According to the docs records_per_epoch can be used to schedule validation and checkpoint frequencies in conjunction with the epoch scheduling unit in the config file: https://docs.determined.ai/latest/reference/reference-training/experiment-config-reference.html?highlight=gc%20policy#config-records-per-epoch. However, upon upgrading to a newer version of Determined, experiments seem to ignore the records_per_epoch field and instead define an epoch by the length of the dataloader. Is there a way to still use records_per_epoch to define epoch length instead?

@azhou-determined
Copy link
Contributor

This was deprecated for PyTorch trials in 0.21.0. The doc you listed needs to be updated to reflect this, and I will make a ticket to do that.

Previously, this value was only necessary because our code made it impossible for us to determine the length of the dataset at the time we initialized the trial. But due to some refactoring, we no longer have this lifecycle issue and opted to not require users to give us this value. You are correct that we now use the chief worker's dataset length to determine this epoch length.

records_per_epoch is no longer supported for PyTorch trials. If you need the length of the epoch to be different than the length of the dataloader, you can configure the length in batches, or modify the length of the actual Dataloader.

@charles-viss
Copy link
Author

This still seems undesirable though because as you mention in the docs, the length of a dataset/dataloader can vary depending on your augmentation or sampling strategy

@garrett361
Copy link
Contributor

Hi @charles-viss, can you expand upon your use case a bit more? Is it that you have a fixed dataset, but then run some augmentation so that the size of the dataset is effectively expanded? If so, is it safe to assume that the size of the expanded dataset is the same for every epoch, or will that also vary?

Thinking about how we might best address your scenario, but would like to make sure we understand the situation precisely, first.

@garrett361
Copy link
Contributor

Also, which version of determined are you currently running?

@charles-viss
Copy link
Author

For example, one use case is training over a fixed dataset with or without category-weighted sampling. Because custom data samplers change the size of the dataloader, i've used the records_per_epoch parameter to set a universal epoch length and ensure consistent training times across experiments. The alternative like you've said is to do all scheduling in batches, which is doable, but would require rewrites in several experiments where I've used the epoch unit to control things like lr scheduling or augmentation strategies.

@charles-viss
Copy link
Author

charles-viss commented May 17, 2023

One helpful feature could be to have the length of an epoch defined by the length of the dataloader by default, but then could be overridden if a records_per_epoch parameter is provided. This would at least enable backward compatibility. We are currently unable to fork experiments or use config files from before the version update due to this scheduling change.

@charles-viss
Copy link
Author

We are currently using determined version 0.21.2

@garrett361
Copy link
Contributor

In 0.21.2, I think converting to using batches in all cases is the most robust solution, though I understand this will necessitate some unfortunate refactoring of your current code. Our apologies!

One helpful feature could be to have the length of an epoch defined by the length of the dataloader by default, but then could be overridden if a records_per_epoch parameter is provided.

Thank you for the suggestion, we will consider adding such a feature.

One more question about your use case: are you using WeightedRandomSampler in the scenario described above? If so, can you use its num_samples argument to keep a fixed epoch size, rather than determined's records_per_epoch? I understand that this does not address the issue regarding forking, in any case.

@charles-viss
Copy link
Author

In one situation we use a WeightedRandomSampler where that could work. In other cases though, such as when using a custom data sampler for triplet loss, the data samplers have essentially infinite length and a switch to batches for scheduling will be necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants