Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ascent hanging with MPI #1384

Open
FrankFrank9 opened this issue Sep 15, 2024 · 10 comments
Open

Ascent hanging with MPI #1384

FrankFrank9 opened this issue Sep 15, 2024 · 10 comments

Comments

@FrankFrank9
Copy link

Hello,

I have the issue where ascent hangs my simulation when running with MPI on multiple cluster nodes.

I compile with:
env enable_mpi=ON ./build_ascent.sh

Did this ever happened before?
Do you have any recommendation?

Best

@cyrush
Copy link
Member

cyrush commented Sep 16, 2024

@FrankFrank9 can you share what actions you are using?

Are you passing the mpi comm handle id as an option during ascent.open()?

@FrankFrank9
Copy link
Author

FrankFrank9 commented Sep 17, 2024

Hi @cyrush, thanks for reaching out.

I'm using PyFR where ascent_mpi.so is wrapped with Ctypes. The ascent wrappers are in the plugin directory in ascent.py.
The simulation hangs only when multiple rank/nodes are required to perform a render operation. I believe there's something wrong in the way I compile and link to my MPI implementation on the cluster.
The actions I use are qcriterion, iso-values and pseudocolor.

Are you passing the mpi comm handle id as an option during ascent.open()?

In PyFR this is the mpi passed to ascent
self.ascent_config['mpi_comm'] = comm.py2f()
where
self.ascent_config is a Ctypes wrapper around conduit.so

Any possible idea?

Also does it matter specifying the vtkm backend?
In PyFR currently there's:
self.ascent_config['runtime/vtkm/backend'] = vtkm_backend
with a 'serial' default value

Best regards

@cyrush
Copy link
Member

cyrush commented Sep 17, 2024

@FrankFrank9

We do provide and test python modules for ascent and conduit. There are some extra checks in there with respect to MPI vs non mpi, however since you are directly using ascent_mpi there should not be a confusion there.

The backend should not matter.

It is possible you have an error on an mpi task and we aren't seeing it.

Can you try the following:

self.ascent_config['exceptions'] = "forward";

That will allow exceptions to flow up, and will likely crash instead of hang.
If that happens we know we have an error case vs an algorithm hang.

When compiling ascent with enable_mpi=ON it's important for the same modules you will use with PyFR to be loaded.

@FrankFrank9
Copy link
Author

FrankFrank9 commented Sep 21, 2024

@cyrush

self.ascent_config['exceptions'] = "forward";

I tried it but I don't get exceptions. Is it the correct syntax to impose?

When compiling ascent with enable_mpi=ON it's important for the same modules you will use with PyFR to be loaded.

I did that but the hanging still persists and I also tried installing with enable_find_mpi="${enable_find_mpi:=ON}" but nothing changed.

An update on this: the hanging is due to the execute function, specifically self.lib.ascent_execute(self.ascent_ptr, self.actions)

@cyrush
Copy link
Member

cyrush commented Sep 23, 2024

Sorry that did not help us. Can you share your actions?

Also, can you try running a very simple action:

-
  action: "save_info"

This will create an yaml file (if successful) that might help us.

@FrankFrank9
Copy link
Author

FrankFrank9 commented Sep 24, 2024

@cyrush

Thanks a lot for the help you're providing.

After save_info the last .yaml, before hanging, given by Ascent is the one attached
out_ascent.txt

Hope this can be useful. Even if exceptions forward is active nothing happens and no errors are thrown. (On 1 rank locally exceptions are correctly forwarded)

@FrankFrank9
Copy link
Author

@cyrush
An update on this: the hanging manifest only on multi-node runnings. Any idea on where to look at? I don't have the opportunity to test elsewhere unfortunately

@cyrush
Copy link
Member

cyrush commented Sep 30, 2024

@FrankFrank9

Sorry this mystery continues. I see some nans in the some of our camera info outputs - but I don't think that would be the source of a hang.

When it hangs, do you get any of the three images you are trying to render (trying to narrow down where to look).
The contour if the q-crit is the most complex pipeline.

Can you share how many MPI tasks + job nodes?

Can we coach you to a set of HDF5s out via an Ascent extract and see if we can reproduce?

@FrankFrank9
Copy link
Author

@cyrush

Can you share how many MPI tasks + job nodes?

It is 80 MPI tasks over 10 nodes. But this happens whenever I use more than 1 node.

When it hangs, do you get any of the three images you are trying to render (trying to narrow down where to look).
The contour if the q-crit is the most complex pipeline.

To be honest it seems more or less random. But I noticed that it is less frequent of those scenes when only 1 render is called. Is there any MPI blocking operation when multiple renderers are triggered on a scene?

Can we coach you to a set of HDF5s out via an Ascent extract and see if we can reproduce?

Yes sure, let me know

@nicolemarsaglia
Copy link
Contributor

nicolemarsaglia commented Oct 3, 2024

@FrankFrank9

Here is an example ascent_actions.yaml for generating an extract of the data.

-
  action: "add_extracts"
  extracts:
    e1:
      type: "relay"
      params:
        path: "your_name_for_extract"
        protocol: "blueprint/mesh/hdf5"

This should generate a root file and a folder of hdf5 files (or just the root file if small enough). Then we can hopefully use this extract to replicate your error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants