-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization issues when using MPI gather #771
Comments
What is the type of the object you're |
A vector of relatively heavy mutable structs: https://github.com/JuliaTrustworthyAI/CounterfactualExplanations.jl/blob/8be3c7652e64e94ef59b291a5736812c7cd5e794/src/counterfactuals/core_struct.jl#L4 |
Maybe there is a type that has different definitions on different MPI ranks for some reason? Function references, especially anonymous ones are also supposed to be a little dicey with serialization. Maybe you can create a minimal example with smaller and smaller subsets of the struct to locate the member that causes the issue. Anything containing Ptr is also not going to work (but that would not lead to an error like that I guess). |
Thanks @lukas-weber! I've gradually removed all anonymous functions from my code, that did indeed cause issues.
Different ranks are indeed receiving different
So I'm happy for this to be closed. |
From the related thread on Discourse: https://discourse.julialang.org/t/serialization-error-when-using-mpi-gather/103802:
That didn't do the trick. @vchuravy kindly suggested to refrain from using multiple threads but unfortunately that also didn't help.
I absolutely would not be surprised if this is just an issue on my end (not placing barriers where they should be is something I can think of). As I mentioned on Dicourse:
Additionally, I was previously calling the part of the code that uses the
CounterfactualExplanations.paralellize
function repeatedly in a loop. Since I've removed the loop, the code seems to run without issues.The code is currently in a private repo but I am happy to give access to anyone who wants to look into this.
Thanks in advance!
The text was updated successfully, but these errors were encountered: