fix: ddp bug #145

JSabadin · 2024-12-13T11:00:11Z

PR Description

This PR updates the metric computation step to ensure that all returned metrics (map and related values) are moved to the GPU device before logging. Without this change, calls to pl.log would trigger runtime errors by attempting distributed reductions (all_reduce) on CPU tensors under the nccl backend in multi-GPU training environments.

klemen1999

Will we encounter same issue if using MeanAveragePrecisionKeypoints or other custom metrics?

JSabadin · 2024-12-16T09:59:36Z

Will we encounter same issue if using MeanAveragePrecisionKeypoints or other custom metrics?

Yes, if the output metrics are stored on the device's CPU

klemen1999 · 2024-12-16T10:00:24Z

Lets test this out and include the fixes for others as well under this PR in that case

codecov · 2024-12-20T11:57:25Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.01%. Comparing base (631b905) to head (02a5672).
Report is 24 commits behind head on main.

✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #145      +/-   ##
==========================================
- Coverage   96.31%   95.01%   -1.31%     
==========================================
  Files         147      170      +23     
  Lines        6304     7517    +1213     
==========================================
+ Hits         6072     7142    +1070     
- Misses        232      375     +143

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kozlov721

Lgtm

fix: ddp bug

6445cbf

JSabadin requested a review from a team as a code owner December 13, 2024 11:00

JSabadin requested review from kozlov721, klemen1999, tersekmatija and conorsim and removed request for a team December 13, 2024 11:00

github-actions bot assigned JSabadin Dec 13, 2024

github-actions bot added the fix Fixing a bug label Dec 13, 2024

klemen1999 reviewed Dec 16, 2024

View reviewed changes

JSabadin and others added 2 commits December 16, 2024 11:41

fix: move kpts mAP output to correct device

0c45b8c

Merge branch 'main' into fix/multi-gpu-bug

02a5672

kozlov721 approved these changes Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ddp bug #145

fix: ddp bug #145

JSabadin commented Dec 13, 2024

klemen1999 left a comment

JSabadin commented Dec 16, 2024 •

edited

Loading

klemen1999 commented Dec 16, 2024 •

edited

Loading

codecov bot commented Dec 20, 2024

kozlov721 left a comment

fix: ddp bug #145

Are you sure you want to change the base?

fix: ddp bug #145

Conversation

JSabadin commented Dec 13, 2024

PR Description

klemen1999 left a comment

Choose a reason for hiding this comment

JSabadin commented Dec 16, 2024 • edited Loading

klemen1999 commented Dec 16, 2024 • edited Loading

codecov bot commented Dec 20, 2024

Codecov Report

kozlov721 left a comment

Choose a reason for hiding this comment

JSabadin commented Dec 16, 2024 •

edited

Loading

klemen1999 commented Dec 16, 2024 •

edited

Loading