Consolidate material on debugging NaNs. #24989

emilyfertig · 2024-11-19T23:19:54Z

Updating the code examples uncovered a couple regressions in jax_debug_nans, described in #24955. This PR comments out parts of docs/error messages that are inconsistent with how the code currently behaves.

messages that are inconsistent with how the code behaves.

yashk2810 · 2024-11-19T23:21:45Z

jax/_src/pjit.py

-           "If you see this error, consider opening a bug report at "
-           "https://github.com/jax-ml/jax.")
-    raise FloatingPointError(msg)
+    # TODO(emilyaf): Re-enable the below when https://github.com/jax-ml/jax/issues/24955 is fixed.


Let's not comment this out. We should do a proper fix instead of this change.

I agree this isn't a substitute for a proper fix. @mattjj and I are working on one. In the meantime, I think commenting this out temporarily is better than the status quo of giving users wrong information (i.e. that their code doesn't produce NaNs without JIT, when it in fact does).

I thought about instead adding a note to flags.md explaining that this error message is broken/misleading, but that seems worse than just omitting it for now.

Maybe we can hold off on this until the actual fix lands? Or maybe your doc change PR can be put on hold for some time?

I feel that this still provides some value rather than raising no error.

The FloatingPointError is still raised with the changes in this PR. This PR just removes the addendum to the error message about failing to produce NaNs with JIT is removed, which currently appears regardless of whether the non-jitted code produces NaNs or not and is therefore often wrong.

The docs are out of sync with how the code behaves, since there have been a couple regressions in debug_nans (the error message addendum always appearing, and the stack trace stopping at the call site of a jitted function instead of the line in the function that produced the NaN). This PR makes the docs/code consistent, so I think it'd be good to merge this as we continue to work on a fix, and update the docs again when the fix is in.

I wanna find out what it takes to fix this. If it's a couple of days or hours, then I would wait. If it's weeks worth of work, then it sounds fine.

It's probably unlikely to be fixed before Thanksgiving (I'm out next week and have other stuff to focus on the next couple days). @mattjj please correct me if I'm wrong and you see a quicker fix.

Consolidate material on debugging NaNs. Comment out parts of docs/error

f274508

messages that are inconsistent with how the code behaves.

emilyfertig requested a review from jakevdp November 19, 2024 23:19

emilyfertig mentioned this pull request Nov 19, 2024

debug_nans error always says the de-optimized function did not produce NaNs #24955

Open

yashk2810 reviewed Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate material on debugging NaNs. #24989

Consolidate material on debugging NaNs. #24989

emilyfertig commented Nov 19, 2024

yashk2810 Nov 19, 2024

emilyfertig Nov 20, 2024

yashk2810 Nov 20, 2024

emilyfertig Nov 20, 2024

yashk2810 Nov 20, 2024

emilyfertig Nov 20, 2024

Consolidate material on debugging NaNs. #24989

Are you sure you want to change the base?

Consolidate material on debugging NaNs. #24989

Conversation

emilyfertig commented Nov 19, 2024

yashk2810 Nov 19, 2024

Choose a reason for hiding this comment

emilyfertig Nov 20, 2024

Choose a reason for hiding this comment

yashk2810 Nov 20, 2024

Choose a reason for hiding this comment

emilyfertig Nov 20, 2024

Choose a reason for hiding this comment

yashk2810 Nov 20, 2024

Choose a reason for hiding this comment

emilyfertig Nov 20, 2024

Choose a reason for hiding this comment