-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ddp bug #145
base: main
Are you sure you want to change the base?
fix: ddp bug #145
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we encounter same issue if using MeanAveragePrecisionKeypoints
or other custom metrics?
Yes, if the output metrics are stored on the device's CPU |
Lets test this out and include the fixes for others as well under this PR in that case |
Codecov ReportAll modified and coverable lines are covered by tests ✅
✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## main #145 +/- ##
==========================================
- Coverage 96.31% 95.01% -1.31%
==========================================
Files 147 170 +23
Lines 6304 7517 +1213
==========================================
+ Hits 6072 7142 +1070
- Misses 232 375 +143 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
PR Description
This PR updates the metric computation step to ensure that all returned metrics (
map
and related values) are moved to the GPU device before logging. Without this change, calls topl.log
would trigger runtime errors by attempting distributed reductions (all_reduce
) on CPU tensors under thenccl
backend in multi-GPU training environments.