Ascent hanging with MPI #1384

FrankFrank9 · 2024-09-15T18:47:01Z

Hello,

I have the issue where ascent hangs my simulation when running with MPI on multiple cluster nodes.

I compile with:
env enable_mpi=ON ./build_ascent.sh

Did this ever happened before?
Do you have any recommendation?

Best

The text was updated successfully, but these errors were encountered:

cyrush · 2024-09-16T15:20:00Z

@FrankFrank9 can you share what actions you are using?

Are you passing the mpi comm handle id as an option during ascent.open()?

FrankFrank9 · 2024-09-17T05:06:01Z

Hi @cyrush, thanks for reaching out.

I'm using PyFR where ascent_mpi.so is wrapped with Ctypes. The ascent wrappers are in the plugin directory in ascent.py.
The simulation hangs only when multiple rank/nodes are required to perform a render operation. I believe there's something wrong in the way I compile and link to my MPI implementation on the cluster.
The actions I use are qcriterion, iso-values and pseudocolor.

Are you passing the mpi comm handle id as an option during ascent.open()?

In PyFR this is the mpi passed to ascent
self.ascent_config['mpi_comm'] = comm.py2f()
where
self.ascent_config is a Ctypes wrapper around conduit.so

Any possible idea?

Also does it matter specifying the vtkm backend?
In PyFR currently there's:
self.ascent_config['runtime/vtkm/backend'] = vtkm_backend
with a 'serial' default value

Best regards

cyrush · 2024-09-17T15:19:49Z

@FrankFrank9

We do provide and test python modules for ascent and conduit. There are some extra checks in there with respect to MPI vs non mpi, however since you are directly using ascent_mpi there should not be a confusion there.

The backend should not matter.

It is possible you have an error on an mpi task and we aren't seeing it.

Can you try the following:

self.ascent_config['exceptions'] = "forward";

That will allow exceptions to flow up, and will likely crash instead of hang.
If that happens we know we have an error case vs an algorithm hang.

When compiling ascent with enable_mpi=ON it's important for the same modules you will use with PyFR to be loaded.

FrankFrank9 · 2024-09-21T21:01:30Z

@cyrush

self.ascent_config['exceptions'] = "forward";

I tried it but I don't get exceptions. Is it the correct syntax to impose?

When compiling ascent with enable_mpi=ON it's important for the same modules you will use with PyFR to be loaded.

I did that but the hanging still persists and I also tried installing with enable_find_mpi="${enable_find_mpi:=ON}" but nothing changed.

An update on this: the hanging is due to the execute function, specifically self.lib.ascent_execute(self.ascent_ptr, self.actions)

cyrush · 2024-09-23T15:17:04Z

Sorry that did not help us. Can you share your actions?

Also, can you try running a very simple action:

-
  action: "save_info"

This will create an yaml file (if successful) that might help us.

FrankFrank9 · 2024-09-24T10:13:05Z

@cyrush

Thanks a lot for the help you're providing.

After save_info the last .yaml, before hanging, given by Ascent is the one attached
out_ascent.txt

Hope this can be useful. Even if exceptions forward is active nothing happens and no errors are thrown. (On 1 rank locally exceptions are correctly forwarded)

FrankFrank9 · 2024-09-29T06:08:17Z

@cyrush
An update on this: the hanging manifest only on multi-node runnings. Any idea on where to look at? I don't have the opportunity to test elsewhere unfortunately

cyrush · 2024-09-30T15:34:50Z

@FrankFrank9

Sorry this mystery continues. I see some nans in the some of our camera info outputs - but I don't think that would be the source of a hang.

When it hangs, do you get any of the three images you are trying to render (trying to narrow down where to look).
The contour if the q-crit is the most complex pipeline.

Can you share how many MPI tasks + job nodes?

Can we coach you to a set of HDF5s out via an Ascent extract and see if we can reproduce?

FrankFrank9 · 2024-10-01T12:16:51Z

@cyrush

Can you share how many MPI tasks + job nodes?

It is 80 MPI tasks over 10 nodes. But this happens whenever I use more than 1 node.

When it hangs, do you get any of the three images you are trying to render (trying to narrow down where to look).
The contour if the q-crit is the most complex pipeline.

To be honest it seems more or less random. But I noticed that it is less frequent of those scenes when only 1 render is called. Is there any MPI blocking operation when multiple renderers are triggered on a scene?

Can we coach you to a set of HDF5s out via an Ascent extract and see if we can reproduce?

Yes sure, let me know

nicolemarsaglia · 2024-10-03T23:40:06Z

@FrankFrank9

Here is an example ascent_actions.yaml for generating an extract of the data.

-
  action: "add_extracts"
  extracts:
    e1:
      type: "relay"
      params:
        path: "your_name_for_extract"
        protocol: "blueprint/mesh/hdf5"

This should generate a root file and a folder of hdf5 files (or just the root file if small enough). Then we can hopefully use this extract to replicate your error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ascent hanging with MPI #1384

Ascent hanging with MPI #1384

FrankFrank9 commented Sep 15, 2024

cyrush commented Sep 16, 2024

FrankFrank9 commented Sep 17, 2024 •

edited

Loading

cyrush commented Sep 17, 2024

FrankFrank9 commented Sep 21, 2024 •

edited

Loading

cyrush commented Sep 23, 2024

FrankFrank9 commented Sep 24, 2024 •

edited

Loading

FrankFrank9 commented Sep 29, 2024

cyrush commented Sep 30, 2024

FrankFrank9 commented Oct 1, 2024

nicolemarsaglia commented Oct 3, 2024 •

edited

Loading

Ascent hanging with MPI #1384

Ascent hanging with MPI #1384

Comments

FrankFrank9 commented Sep 15, 2024

cyrush commented Sep 16, 2024

FrankFrank9 commented Sep 17, 2024 • edited Loading

cyrush commented Sep 17, 2024

FrankFrank9 commented Sep 21, 2024 • edited Loading

cyrush commented Sep 23, 2024

FrankFrank9 commented Sep 24, 2024 • edited Loading

FrankFrank9 commented Sep 29, 2024

cyrush commented Sep 30, 2024

FrankFrank9 commented Oct 1, 2024

nicolemarsaglia commented Oct 3, 2024 • edited Loading

FrankFrank9 commented Sep 17, 2024 •

edited

Loading

FrankFrank9 commented Sep 21, 2024 •

edited

Loading

FrankFrank9 commented Sep 24, 2024 •

edited

Loading

nicolemarsaglia commented Oct 3, 2024 •

edited

Loading