Use of SoftRobots in High Performance Computing #252

andersonnardin · 2023-11-20T11:57:33Z

andersonnardin
Nov 20, 2023

Hi all,

I have a few questions about using the SoftRobots plugin in High-Performance Computing. To explore its total capacity, I am also using the SOFA MultiThreading.

First, SOFA version 23.06 does not come with the SoftRobots tutorials folder. Is this permanent?

Second, in version 22.12 I cannot use the component "ParallelTetrahedronFEMForceField". It accuses the error:
Python exception:
ValueError: Object type ParallelTetrahedronFEMForceField was not created
The object 'ParallelTetrahedronFEMForceField' is not in the factory.
But the following exits:
: ParallelHexahedronFEMForceField (92% match)
: TetrahedronFEMForceField (85% match)
: HexahedronFEMForceField (76% match)
: StandardTetrahedralFEMForceField (75% match)
: TetrahedronDiffusionFEMForceField (73% match)
: TetrahedralCorotationalFEMForceField (70% match)
However, the same component is allowed in version 23.06. Was that a bug fixed?

Third (and most important), I have a scene where I am using the Multithreading plugin, components ParallelTetrahedronFEMForceField (just available in 23.06), ParallelBruteForceBroadPhase, and ParallelBVHNarrowPhase. The scene has a large number of tetrahedrons (seriously!) and runs in an HPC with a large number of CPUs. However, I verified no improvement in computational time compared to the same scene without the mentioned components (without Parallel...) or even when it runs on my Personal Computer with much fewer cores. Any clue about this behavior?

Thank you in advance!

alxbilger · 2023-11-20T12:54:27Z

alxbilger
Nov 20, 2023
Maintainer

Hi,

ParallelTetrahedronFEMForceField has been introduced in January. That is why it is not available in previous versions. See sofa-framework/sofa#3552.

Parallel components are usually not trivially parallelized. They have synchronization mechanism, which means the speed up does not match the number of CPUs. In addition, the speed up due to parallelism of a component affects only a part of a time step, not the entire time step.

Not knowing the scene, here are my guesses if you observe no improvement:

If you use a direct linear solver, a large part of the matrix assembly can be due to the matrix data structure. It is not thread-safe, that is why it must be done sequentially, and it is rather time-consuming.
The force computation is not (yet) parallelized. In simulations where the number of tets is more modest, this step is usually not the bottleneck.
You have a speed-up, but compared to other steps of the simulation, it is not significant. For example, constraint solving.
We never tried this component on an HPC. This component and its underlying task pool is maybe not efficient for this kind of architecture.

I advise to run a few time steps and to analyze the timers: https://www.sofa-framework.org/community/doc/using-sofa/performances/inspect-performances/

Alex

0 replies

andersonnardin · 2023-11-20T15:46:15Z

andersonnardin
Nov 20, 2023
Author

HIi @alxbilger,

Thank you for your response.

1- Yes, I am using a linear solver (LinearSolverConstraintCorrection). What would be an alternative to avoid the sequential execution due to its presence?

2- Yes, I am computing forces, but this is unavoidable for me due to the nature of the study I am conducting. I look forward to a solution that includes the parallelized version of this.

3- I have used the AdvancedTimer profiler, and I cannot verify any speed-up. Both steps ConstraintSolver and Collision Detection take around 50% of the time either in my PC or in HPC.

4- I tried to run SOFA without the graphical interface to test it (using the argument -g batch), but it accuses the error:
[ERROR] [DAGSimulation(Simulation)] Cannot load file 'C:\Users...': extension ( py) is only supported if the plugin SofaPython3 is loaded. SofaPython3 must be loaded first before being able to load the file.

However, the SofaPython3 was loaded. This happens in versions 22.12 and 23.06 on both Windows and Linux. Any idea about how to solve it?

9 replies

alxbilger Nov 21, 2023
Maintainer

Keep in mind that the optimization of performances is about timings, not the relative percentages.

The ComputeForce timer is included in the FreeMotion timer, but in your case I guess this step is in an asynchronous thread (parallelCollisionDetectionAndFreeMotion set to true). That's why it is hidden. You can set parallelCollisionDetectionAndFreeMotion to false to benchmark the free motion step (or use Tracy), and set it back to true later.

You have 2 steps which are the bottlenecks:

The free motion step: in this step, you have:
1. The force computation -> not really time consuming but it needs to be measured. Does not use parallelism for now
2. The linear system assembly -> time consuming. Does not use parallelism for now
3. The linear system solving -> really time consuming. Does not use parallelism for now. To leverage parallelism in this step, I recommend https://github.com/SofaDefrost/SofaCUDALinearSolver or https://github.com/sofa-framework/sofa/tree/master/applications/plugins/SofaPardisoSolver/. The latter is probably not up-to-date. I've never tried it myself.
The constraint solving -> not parallelized except with parallelInverseProduct.

If some steps of the simulation are not parallelized, they won't get any speed up if run on an HPC. So you must leverage as much parallelism as possible.

Another option to speed-up the linear system solving is the combination of a preconditioned conjugate gradient and an asynchronous linear solver (see for example https://github.com/sofa-framework/sofa/blob/master/examples/Component/LinearSolver/Preconditioner/FEMBAR_PCG_AsyncSparseLDLSolver.scn). With this configuration, the parallelism from ParallelTetrahedronFEMForceField is used in the conjugate gradient. The asynchronous solver factorizes the matrix in an different thread. However, you won't gain anything in the constraint solving.

andersonnardin Nov 30, 2023
Author

Hi @alxbilger

First, I toggled the parallelCollisionDetectionFreeMotion and found the timer ComputeForce. However, as you can see in the picture below, it takes less than 1%.

Second, I tried the combination of the preconditioned conjugate gradient and the asynchronous linear solver. In practice, I substituted the previous SparseLDLSolver for:

      # unit.addObject('SparseLDLSolver', name='preconditioner', template='CompressedRowSparseMatrixMat3x3d')
       unit.addObject('ShewchukPCGLinearSolver', name = 'PCG', iterations = '1000', preconditioner = '@preconditioner')
       unit.addObject('AsyncSparseLDLSolver', name='preconditioner', template='CompressedRowSparseMatrixMat3x3d')

Then, I have gotten the error:
[ERROR] [LinearSolverConstraintCorrection(LinearSolverConstraintCorrection)] Can not use the solver PCG because it is templated on GraphScatteredType

Third, before testing the parallelInverseProduct, I should compile the sofa master branch. I pulled the Ubuntu image from the recommended docker hub:
https://hub.docker.com/r/sofaframework/sofabuilder_ubuntu
However, I expected the container would have a graphical desktop environment, but I could not access it via VNC Viewer or browser. I verified the image was updated yesterday. Could you verify and let me know whether it works properly for you?

alxbilger Dec 7, 2023
Maintainer

Yes, ComputeForce is really not the bottleneck.
Your LinearSolverConstraintCorrection should refer to the preconditioner, not the PCG
I never used docker with SOFA. I recommend opening a discussion on the SOFA repository for this specific issue.

andersonnardin Dec 13, 2023
Author

Hi @alxbilger ,

I think I misunderstood something. You said:
"Your LinearSolverConstraintCorrection should refer to the preconditioner, not the PCG"
But... doesn't PCG stand for PRECONDITIONED conjugate gradient? So, how should I refer to preconditioner, not PCG?
I checked again the example you sent, but I did not find the LinearSolverConstraintCorrection component. This is mine:

alxbilger Dec 13, 2023
Maintainer

In the example you gave:

unit.addObject('ShewchukPCGLinearSolver', name = 'PCG', iterations = '1000', preconditioner = '@preconditioner')
unit.addObject('AsyncSparseLDLSolver', name='preconditioner', template='CompressedRowSparseMatrixMat3x3d')

The component preconditioner is the preconditioner for the preconditioned CG called PCG. Your LinearSolverConstraintCorrection should refer to the preconditioner (AsyncSparseLDLSolver), not ShewchukPCGLinearSolver. From your screenshot, it seems it is the case. But you still have the error [ERROR] [LinearSolverConstraintCorrection(LinearSolverConstraintCorrection)] Can not use the solver PCG because it is templated on GraphScatteredType?

andersonnardin · 2024-08-30T10:55:35Z

andersonnardin
Aug 30, 2024
Author

Hi @alxbilger,
I revisited this issue and here I can show a few results.

Consider these blocks of code as the options I tested. Here we have blocks A1 (uncommented) vs B1 (commented) :

    rootNode.addObject('FreeMotionAnimationLoop')
    rootNode.addObject('GenericConstraintSolver', tolerance=1e-12, maxIterations=10000, name='GCS', computeConstraintForces=True)
    rootNode.addObject('DefaultPipeline')
    # rootNode.addObject('FreeMotionAnimationLoop', parallelCollisionDetectionAndFreeMotion=True, parallelODESolving=True)
    # rootNode.addObject('GenericConstraintSolver', multithreading = True, tolerance=1e-12, maxIterations=10000, name='GCS', computeConstraintForces=True)
    # rootNode.addObject('CollisionPipeline')

Here we have A2 (uncommented) vs B2 (commented):

    rootNode.addObject('BruteForceBroadPhase')
    rootNode.addObject('BVHNarrowPhase')
    # rootNode.addObject('ParallelBruteForceBroadPhase')
    # rootNode.addObject('ParallelBVHNarrowPhase')

Here we have A3 (uncommented) vs B3 (commented)

           unit.addObject('SparseLDLSolver', name='preconditioner')
        # unit.addObject('SparseLDLSolver', name='preconditioner', template='CompressedRowSparseMatrixMat3x3d')
        # unit.addObject('ShewchukPCGLinearSolver', name = 'PCG', iterations = '1000', preconditioner = '@preconditioner')
        # unit.addObject('AsyncSparseLDLSolver', name='preconditioner', template='CompressedRowSparseMatrixMat3x3d')

Here we have A4 (uncommented) vs B4 (commented):

        unit.addObject('TetrahedronFEMForceField', template='Vec3d', name='FEM', method='large', poissonRatio=poissonratiounit, youngModulus=youngmodulusunit)#, showVonMisesStressPerElement=True, computeVonMisesStress=1)
        unit.addObject('LinearSolverConstraintCorrection')
        # unit.addObject('ParallelTetrahedronFEMForceField', template='Vec3d', name='FEM', method='large', poissonRatio=poissonratiounit, youngModulus=youngmodulusunit)#, showVonMisesStressPerElement=True, computeVonMisesStress=1)
        # unit.addObject('PrecomputedConstraintCorrection')

As you may notice, all B blocks are attempts to extract some performance improvement vs the ordinary approach (A).
Ok, so for the case where I had all blocks A. I got:

As you can see, bottlenecks are the buildStiffness and the numeric_factorization.

So, for the case where I had all blocks B, i.e, what was commented became uncommented and vice-versa.
I got:

As you see, in this case, WaitFreeMotion is the bottleneck.
Of course, I also tried all combinations of blocks A and B. What I noticed is that B3 over A3 alone saves around 1 second... and that is all!
None of the other B-attempts produced any noticeable improvement.

Now, my question is given the above mentioned bottlenecks, is there a way to optimize them?

5 replies

alxbilger Aug 30, 2024
Maintainer

The problem with the profiling of the B blocks, is that the true bottlenecks are hidden in WaitFreeMotion. The profiler does not support multiple threads. So I cannot really help to identify the bottleneck in the B blocks. I suggest that you disable parallel collision detection and free motion tasks (parallelCollisionDetectionAndFreeMotion in FreeMotionAnimationLoop). You can also use Tracy, which does support multiple threads.

But the A blocks is a good start to improve the performances. You identified two bottlenecks, and your attempts to optimize should focus on them. For example, it is useless to improve the performances of the collision detection by using parallel implementations as it takes only 2% of the whole time step.

Note that you can improve your performances easily by reducing the number of degrees of freedom. Can you give me an idea of the number of DoFs and number of elements?

buildStiffness: two tasks are performed here. 1) Computation of the matrix entries, 2) adding the matrix entries in the matrix. The second task is always sequential. The only impact on the performances is the template of the linear solver. In this regard, I recommend template='CompressedRowSparseMatrixMat3x3d'. Just change that in your A blocks and see if it improves. The first task (computing the matrix entries) can be computed in parallel in ParallelTetrahedronFEMForceField. However, because of the second task, you won't get an optimal boost of performances
numeric_factorization: this task is highly sequential and very complex to parallelize. All the parallel implementations only improve the performances for a very large number of DoFs. You can try https://github.com/SofaDefrost/SofaCUDALinearSolver. Another solution is using a preconditioned conjugate gradient, like what you suggested:

unit.addObject('ShewchukPCGLinearSolver', name = 'PCG', iterations = '1000', preconditioner = '@preconditioner')
unit.addObject('AsyncSparseLDLSolver', name='preconditioner', template='CompressedRowSparseMatrixMat3x3d')

Change that only, and see if it improves.

andersonnardin Aug 30, 2024
Author

So, by imposing parallelCollisionDetectionAndFreeMotion to false when I use the blocks B I got:

as you can see the only difference is that instead of having WaitFreeMotion I have waitParallelTasks, no more information. For using Tracy I would need to compile SOFA setting CMake variable SOFA_TRACY but now I am using SOFA binaries...

Number of DoFs: rootNode.unit1.tetras.position.size = 7091
Number of tetrahedrons = rootNode.unit1.container.nbTetrahedra = 20848

Using template='CompressedRowSparseMatrixMat3x3d' is exactly what makes me gain 1 FPS average when I use B3 instead of A3 as I mentioned previously that is the first line of B3. While ParallelTetrahedronFEMForceField has no noticeable improvement.

the code:

unit.addObject('ShewchukPCGLinearSolver', name = 'PCG', iterations = '1000', preconditioner = '@preconditioner')
unit.addObject('AsyncSparseLDLSolver', name='preconditioner', template='CompressedRowSparseMatrixMat3x3d')

are second and third lines of B3 which alone does not provide any noticeable improvement.

I also saw the examples of plugins\SofaCUDA\share\sofa\examples\SofaCUDA\benchmarks\ where scenes with GPU are 10x faster than CPU counterparts. On my scene, I added the plugin, and made
template = 'CudaVec3f' instead of 'Vec3d' in MechanicalObject and TetrahedronFEMForceField.
Then, some problems arise... First, it complains about a fixedplaneconstraint of the scene:
[ERROR] [SofaPython3::SceneLoader] Unable to completely load the scene from file 'C:/.../version1/environment.py'.
Python exception:
ValueError: Object type FixedPlaneConstraint<> was not created
The object is in the factory but cannot be created.
Requested template : None
Used template : None

If I get rid of the fixed plane, two things may happen. More commonly, the body explodes. If it runs, it is 0.5 to 0.9 FPS slower than without the CUDA template.

alxbilger Aug 31, 2024
Maintainer

So, by imposing parallelCollisionDetectionAndFreeMotion to false when I use the blocks B I got: as you can see the only difference is that instead of having WaitFreeMotion I have waitParallelTasks, no more information.

Set also parallelODESolving to false.

For the CUDA problems, I suggest you to read the other discussions. For example sofa-framework/sofa#4705. But I am not sure the mapping method work with projective constraints (FixedPlaneConstraint)...

andersonnardin Sep 1, 2024
Author

Hi @alxbilger,

If I set parallelODESolving, then the profile is pretty much the one I sent for the all A blocks option.

I worked around the problem with the constraint to try to use CUDA. In general, I noticed that for some reason SOFA version 22.12 is more stable than 23.06. Then, I can run templates CudaVec3f, however, it is always around 0.5 FPS slower considering the already slow simulations...
I also monitored the use of GPU and it is low (less than 5% on my local machine) when I run the benchmarks the use goes up to 50% of the GPU. Then I noticed the benchmarks rely on 'CGLinearSolver' or 'RungeKutta4Solver' instead of 'SparseLDLSolver' or 'ShewchukPCGLinearSolver'. In fact, with those solvers, the use of GPU increases, and the simulation is faster but steady. Better explaining, I have a class that, onAnimateBeginEvent, automatically displaces the body on the scene. This automatic displacement works if I have 'LinearSolverConstraintCorrection' and this does not work with 'CGLinearSolver' or 'RungeKutta4Solver'. When I try, I get the error:
[ERROR] [LinearSolverConstraintCorrection(LinearSolverConstraintCorrection)] Can not use the solver linear solver because it is templated on GraphScatteredTyppe
So, LinearSolverConstraintCorrection requires the 'SparseLDLSolver' which does not use GPU therefore it is slower.

To the best I could I tried to use all the above-mentioned resources, do you have any other suggestion?

Anyway, thank you for your support!

alxbilger Sep 2, 2024
Maintainer

Have you considered using Model Order Reduction? https://github.com/SofaDefrost/ModelOrderReduction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of SoftRobots in High Performance Computing #252

{{title}}

Replies: 3 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Use of SoftRobots in High Performance Computing #252

andersonnardin Nov 20, 2023

Replies: 3 comments · 14 replies

alxbilger Nov 20, 2023 Maintainer

andersonnardin Nov 20, 2023 Author

alxbilger Nov 21, 2023 Maintainer

andersonnardin Nov 30, 2023 Author

alxbilger Dec 7, 2023 Maintainer

andersonnardin Dec 13, 2023 Author

alxbilger Dec 13, 2023 Maintainer

andersonnardin Aug 30, 2024 Author

alxbilger Aug 30, 2024 Maintainer

andersonnardin Aug 30, 2024 Author

alxbilger Aug 31, 2024 Maintainer

andersonnardin Sep 1, 2024 Author

alxbilger Sep 2, 2024 Maintainer

andersonnardin
Nov 20, 2023

Replies: 3 comments 14 replies

alxbilger
Nov 20, 2023
Maintainer

andersonnardin
Nov 20, 2023
Author

alxbilger Nov 21, 2023
Maintainer

andersonnardin Nov 30, 2023
Author

alxbilger Dec 7, 2023
Maintainer

andersonnardin Dec 13, 2023
Author

alxbilger Dec 13, 2023
Maintainer

andersonnardin
Aug 30, 2024
Author

alxbilger Aug 30, 2024
Maintainer

andersonnardin Aug 30, 2024
Author

alxbilger Aug 31, 2024
Maintainer

andersonnardin Sep 1, 2024
Author

alxbilger Sep 2, 2024
Maintainer