-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stabilize simulations to avoid differences between CI and local runs #2007
Conversation
Review checklistThis checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging. Purpose and scope
Code quality
Documentation
Testing
Performance
Verification
Created with ❤️ by the Trixi.jl community. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2007 +/- ##
==========================================
- Coverage 96.32% 96.32% -0.00%
==========================================
Files 470 470
Lines 37485 37483 -2
==========================================
- Hits 36106 36104 -2
Misses 1379 1379
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
After c1d0c5e, which used the locally calculated errors, only the sedov blast elixir fails on CI. With 1.11, all tests work locally. |
In c47e09a, I added CI tests on macOS, windows and with julia1.11.
|
Removing the use of |
I checked if the maximum number of iterations (maxIterations) is reached for the newton method. Elixir
And CFL test (Not tested if CI test fails)
Why are there still deviations at all? |
I checked the results of
on a local macOS machine (arm64-apple-darwin22.4.0 with an 10 × Apple M2 Pro CPU). The results are identical (up to the 8th digit) to my local results (ubuntu) and the ubuntu CI results. And therefore different to the macOS CI results. |
Because I received the "correct"(=same as on ubuntu) results on a local macOS machine (see comment) for the structured sedov blast, I added a macOS test using aarch64 instead of x64 in ed77e0f. The tree_part2 job just didn't at first (This seems to happen elsewhere as well (see here)). |
So, as a conclusion
|
@sloede In my opinion, this process is finished.
I still have to remove many tests from the CI runs and do some clean up, so the PR is still in draft mode. Anyway, getting a review from you would still help a lot @sloede. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many tests are still disabled and should be re-enabled before merging, but the changes to the elixirs seem to be reasonable.
However, does this also mean that the subcell limiting methods as implemented here is not unconditionally safe and that one has to choose "correct" parameters? Or was this already known and just surfaced here in an unsuspecting way?
Yes of course, I just wanted to keep it with all the different variants until it decided which one I will keep
Since we use an iterative method to find a stable amount of limiting for nonlinear variables, you can say that, yes. When one only has to look at the bounds checking to see if the limiting works properly, why did we have these issues here? This is basically just because I didn't pay too much attention to this before. And of course, when changing setups within the tests, this has to be checked again. |
I agree. It seems that the major issue was the choice of the Newton method parameters and the CFL. The subcell-limiting methods are safe when the CFL is below a threshold and when the bounds are selected properly for the equation being solved. For instance, positivity of density and pressure is "safe" for the Euler equations. In the case of non-linear constraints, such as pressure, the subcell-limited method converges to the right/safe scheme only if the non-linear solver manages to converge (related to the Newton parameters). |
OK, thanks for the clarification! If not already in there, I think it would be good if you could add such an (extended) description to the docs for future reference. |
So, a final conclusion before I remove all additional tests. All these elixirs are susceptible to differences with local limiting (bounds checking functionality only shows errors in machine precision). That includes failing CI tests for macOS, x64, but also possible differences between runs on different systems/architectures/Julia versions etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nearly done 👍 Just one small comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There were always problems with failing tests (mostly the subcell blast waves simulations) with different setups (e.g. local, CI and also different julia versions 1.9 vs. 1.10 vs. 1.11).
Moreover, simulations on macOS were completely different to windows and Linux (see trixi-framework/Trixi2Vtk.jl#67 (comment)).
I realized that most of the time it's the same elixirs with local limiting (see trixi-framework/Trixi2Vtk.jl#67 (comment)).
Looking into those simulations showed me that in these simulations there are also deviations from the calculated subcell bounds.
I adapted parameters to hopefully stabilize the simulations and hope that this fixes also the different results with different setups.
Updates are following...