Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve segmentation faults for FULLMG_CYCLE #1362

Merged
merged 10 commits into from
May 17, 2022
Merged

Solve segmentation faults for FULLMG_CYCLE #1362

merged 10 commits into from
May 17, 2022

Conversation

suargi
Copy link
Contributor

@suargi suargi commented Aug 24, 2021

FULLMG_CYCLE produces segmentation faults in different ways:

  1. When using the SST turbulent solver
  2. When running in parallel (mpirun/mpiexec)

This PR tries to fix these segmentation faults.

Proposed Changes

First issue
In CTurbSSTSolver::CTurbSSTSolver() we do not initialize the residual nor the solution if we are using a FULLMG_CYCLE, producing a segmentation fault. I have included a line of code to consider the FULLMG_CYCLE scenario.

Second issue
This issue is not related to mpi per se but to domain partitioning.
In CMultiGridGeometry::CMultiGridGeometry() is computed the ratio between the number of points in the finest grid and a given coarse grid level. If this ratio is below 2.5 (I do not know why do we make this evaluation neither the reason why exactly 2.5) a multigrid level is removed (without warning the user!?), see lines 629-632. For a few cases that I have tested, when running in parallel certain grid levels have a ratio below 2.5 hence are removed. This does not happen when running in serial.
In CMultiGridIntegration::MultiGrid_Iteration() the index of the "finest grid" is then required and the system of equations at the "finest grid" level is intended to be solved. This index is set by default with the number multigrid levels specified by the user on the config file. In case that multigrid levels have been removed, when trying to access to the "finest level" mesh in CMultiGridIntegration::MultiGrid_Iteration() that level does not exist and produces a segmentation fault.

To solve this issue:

  • Stop removing multigrid levels in CMultiGridGeometry::CMultiGridGeometry() or improve the function. Why do we need to check whether the ratio is below a certain level? We should just check if the coarse level is empty, i.e., no control volumes.
  • When removing a MG level in CMultiGridGeometry::CMultiGridGeometry() subtract one to the index of the finest grid, i.e., config->SubtractFinestMesh();

What are your thoughts?

After solving the previous issues, with mpirun and depending on the number of cores to be used SU2 will still produce a segmentation fault when using Full MG. This problem is not present when using mpiexec. I traced it back and found that in CFVMFlowSolverBase<V, FlowRegime>::Friction_Forces the method GetNormal_Neighbor() from CVertex returns a non existing point. Before delving into the problem would be nice if you could corroborate these findings. My versions of mpirun and mpiexec are 3.1.3 and 3.3 respectively.

Additional Work

Additionally I have corrected the description of the variable Convergence_FullMG in CIntegration.hpp. Related to this same variable, what is the definition of convergence for the full multigrid?
Further, this parameter, Convergence_FullMG, is set to false in CIntegration::CIntegration() and never updated i.e., there is no evaluation whether the FullMG has converged or not. Consequently, the function SetProlongated_Solution() in CMultiGridIntegration::MultiGrid_Iteration() is never executed.

PR Checklist

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with the '-Wall -Wextra -Wno-unused-parameter -Wno-empty-body' compiler flags, or simply --warnlevel=2 when using meson).
  • My contribution is commented and consistent with SU2 style.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp) , if necessary.

- In CTurbSSTSolver.cpp consider the FULLMG_CYCLE scenario
- In CIntegration.hpp fixed the description for the variable "Convergence_FullMG"
@pcarruscag pcarruscag changed the title [WIP] Solve segmenation faults for FULLMG_CYCLE [WIP] Solve segmentation faults for FULLMG_CYCLE Aug 24, 2021
@pcarruscag
Copy link
Member

You should be able to use CConfig::SetFinestMesh in CDriver::Geometrical_Preprocessing_FVM to initialize this to the coarsest grid that was produced.

The threshold to stop agglomeration should stay. But you can expose it as a config option if you want.

@pcarruscag
Copy link
Member

By the way, thank you for fixing this 👍. And the reason the agglomeration stops earlier in parallel than in serial is that the algorithm is less effective (due to the shape of the MPI partitions not being very nice).

@TobiKattmann
Copy link
Contributor

TobiKattmann commented Aug 27, 2021

@suargi Just a service note on the hybrid_regression_AD.py reg tests that fail: They seem to sometimes fail due to mood swings or idk. So if you Re-run them in the Checks tab above or merge the latest develop they might not fail on you

@suargi
Copy link
Contributor Author

suargi commented Aug 30, 2021

You should be able to use CConfig::SetFinestMesh in CDriver::Geometrical_Preprocessing_FVM to initialize this to the coarsest grid that was produced.

The threshold to stop agglomeration should stay. But you can expose it as a config option if you want.

Your suggestion is also valid, but in my opinion it would be more coherent, in terms of code structure, to reduce the finest mesh level whenever we modify the number of multigrid levels. So in CMultiGridGeometry::CMultiGridGeometry add this line of code
config->SetFinestMesh(iMesh-1);
after reducing the number of MGLevels.

The threshold to stop agglomeration should stay. But you can expose it as a config option if you want.
I can do that.

@suargi
Copy link
Contributor Author

suargi commented Aug 30, 2021

@suargi Just a service note on the hybrid_regression_AD.py reg tests that fail: They seem to sometimes fail due to mood swings or idk. So if you Re-run them in the Checks tab above or merge the latest develop they might not fail on you

Thank you @TobiKattmann. I will keep it in mind!

@pcarruscag
Copy link
Member

In that case you can also modify CConfig::SetMGLevels to also set the FinestLevel, that way this is always up to date, even if we use it somewhere else.

@pcarruscag
Copy link
Member

Btw please change one of the regressions to cover this feature.

@suargi
Copy link
Contributor Author

suargi commented Aug 30, 2021

That sounds great. After introducing those fixes I can definitely create a regression test to cover the fullmultigrid feature. Nevertheless there is still the segmentation fault problem when using mpirun. I have to delve into that.

@maxaehle
Copy link
Contributor

maxaehle commented Nov 9, 2021

Commit a458251 deals with the second issue as proposed by Pedro (#1362 (comment)).

@maxaehle
Copy link
Contributor

maxaehle commented Nov 9, 2021

Let me recall the other sub-issues reported by @suargi:

Third issue

After solving the previous issues, with mpirun and depending on the number of cores to be used SU2 will still produce a segmentation fault when using Full MG. This problem is not present when using mpiexec. I traced it back and found that in CFVMFlowSolverBase<V, FlowRegime>::Friction_Forces the method GetNormal_Neighbor() from CVertex returns a non existing point.

I was able to partially reproduce this with @suargi's help, by modifying SU2/TestCases/rans/rae2822/turb_SA_RAE2822.cfg, setting MGLEVEL=6 (instead of 3) and MGCYCLE=FULLMG_CYCLE (instead of W_CYCLE). This could become the new regression test by the way. Both with mpirun -n 3 and mpiexec -n 3 we observed non-existing point IDs of the normal neighbor. In #1430 I describe what I think is the problem. For mpirun -n 2 and mpiexec -n 2 I get another kind of error, I'll further investigate this. We did not see a difference in behaviour of mpirun and mpiexec on my system however.

Edit: My version are mpirun (Open MPI) 4.1.0 and mpiexec (OpenRTE) 4.1.0.

Fourth issue

Further, this parameter, Convergence_FullMG, is set to false in CIntegration::CIntegration() and never updated i.e., there is no evaluation whether the FullMG has converged or not. Consequently, the function SetProlongated_Solution() in CMultiGridIntegration::MultiGrid_Iteration() is never executed.

@maxaehle
Copy link
Contributor

maxaehle commented Nov 9, 2021

For mpirun -n 2 and mpiexec -n 2 I get another kind of error, I'll further investigate this.

I had a look at the following part of the output:

Setting the multigrid structure.
+-------------------------------------------+
|  MG Level|       CVs|Aggl. Rate|       CFL|
+-------------------------------------------+
|         0|     14165|    1/1.00|       2.5|
|         1|      3892|    1/3.64|   1.96566|
|         2|      1031|    1/3.77|   1.51755|
|         3|       276|    1/3.74|   1.17777|
|         4|        70|    1/3.94|  0.889704|
|         5|        19|    1/3.68|  0.695288|

So I think that MGLEVEL=6 is too high, it makes the coarsest mesh have too few points. We should choose e.g. MGLEVEL=4 for testing. The turb_SA_RAE2822.cfg with MGLEVEL=4, MGCYCLE=FULLMG_CYCLE runs without error with mpirun -n 2. With mpirun -n 3, I get an error sometimes. I think this is related to the third issue / #1430 .

@maxaehle
Copy link
Contributor

In CConfig::SetPostprocessing:

if (Restart) MGCycle = V_CYCLE;

I don't know what is the purpose of that, and whether we should warn the user if we deviate from what the cfg file wants us to do.

@pcarruscag
Copy link
Member

I can see why FULL MG would not be useful for restarts, but ruling out W_CYCLE seems unnecessary.

@stale stale bot added the stale label Mar 2, 2022
@su2code su2code deleted a comment from stale bot Mar 2, 2022
@stale stale bot removed the stale label Mar 2, 2022
@stale
Copy link

stale bot commented May 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is still a relevant issue please comment on it to restart the discussion. Thank you for your contributions.

@stale stale bot added the stale label May 2, 2022
@bigfooted
Copy link
Contributor

do we need much more work on this PR?

@stale stale bot removed the stale label May 2, 2022
@maxaehle
Copy link
Contributor

maxaehle commented May 2, 2022

The following tasks have been finished:

The following optional tasks remain:

@suargi
Copy link
Contributor Author

suargi commented May 3, 2022

As @maxaehle has commented above, the two main issues of this pull request have been already solved. Consequently, I would suggest to merge and close this PR. Some related problems have been discovered which could be added to #1487 .

@suargi suargi changed the title [WIP] Solve segmentation faults for FULLMG_CYCLE Solve segmentation faults for FULLMG_CYCLE May 3, 2022
@pcarruscag pcarruscag linked an issue May 15, 2022 that may be closed by this pull request
@pcarruscag pcarruscag merged commit 6ca6ec2 into su2code:develop May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Illegal access to normal neighbor of halo vertex
6 participants