Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error in internal_Init_thread: Other MPI error #824

Closed
Blumenkranz opened this issue Mar 10, 2024 · 10 comments · Fixed by #825
Closed

Fatal error in internal_Init_thread: Other MPI error #824

Blumenkranz opened this issue Mar 10, 2024 · 10 comments · Fixed by #825

Comments

@Blumenkranz
Copy link

I got these on my M2 Mac with Julia 1.10.2:

Fatal error in internal_Init_thread: Other MPI error, error stack:
internal_Init_thread(67)...........: MPI_Init_thread(argc=0x0, argv=0x0, required=2, provided=0x16db94160) failed
MPII_Init_thread(234)..............: 
MPID_Init(67)......................: 
init_world(171)....................: channel initialization failed
MPIDI_CH3_Init(84).................: 
MPID_nem_init(314).................: 
MPID_nem_tcp_init(175).............: 
MPID_nem_tcp_get_business_card(397): 
GetSockInterfaceAddr(370)..........: gethostbyname failed, bogon (errno 0)
@giordano
Copy link
Member

Can you please show the output of the command MPI.versioninfo() and the code you're running. It's hard to make any guess about what's going on without basic information.

@Blumenkranz
Copy link
Author

Blumenkranz commented Mar 10, 2024

The output of MPI.versioninfo() is

  binary:  MPICH_jll
  abi:     MPICH

Package versions
  MPI.jl:             0.20.18
  MPIPreferences.jl:  0.1.10
  MPICH_jll:          4.1.2+0

Library information:
  libmpi:  /Users/gelongqing/.julia/artifacts/f99c980548677ee7ea55b4fb5a14c9036e7ce0b6/lib/libmpi.12.dylib
  libmpi dlpath:  /Users/gelongqing/.julia/artifacts/f99c980548677ee7ea55b4fb5a14c9036e7ce0b6/lib/libmpi.12.dylib
  MPI version:  4.0.0
  Library version:  
    MPICH Version:      4.1.2
    MPICH Release date: Wed Jun  7 15:22:45 CDT 2023
    MPICH ABI:          15:1:3
    MPICH Device:       ch3:nemesis
    MPICH configure:    --prefix=/workspace/destdir --build=x86_64-linux-musl --host=aarch64-apple-darwin20 --enable-shared=yes --enable-static=no --with-device=ch3 --disable-dependency-tracking --enable-fast=all,O3 --docdir=/tmp --mandir=/tmp --disable-opencl FFLAGS=-fallow-argument-mismatch FCFLAGS=-fallow-argument-mismatch
    MPICH CC:           cc   -fno-common  -DNDEBUG -DNVALGRIND -O3
    MPICH CXX:          c++   -DNDEBUG -DNVALGRIND -O3
    MPICH F77:          gfortran -fallow-argument-mismatch  -O3
    MPICH FC:           gfortran -fallow-argument-mismatch  -O3

The bug appears even when I just run MPI.Init() with mpiexecjl --project -n 2 julia ./test.jl. And the exactly same code worked yesterday. The only change is I opened my newly bought Mac... Maybe the error is caused by the synchronization of VSCode?

@Blumenkranz
Copy link
Author

Well, when I cut off the internet, the code works... It seems that the other Mac is considered as a processor?

@giordano
Copy link
Member

Ok, this seems similar to the error reported at idaholab/moose#23610, for which there's a suggested workaround at https://mooseframework.inl.gov/help/troubleshooting.html

gethostbyname failed, localhost (errno 3)

This is a fairly common occurrence which happens when your internal network stack / route, is not correctly configured for the local loopback device. Thankfully, there is an easy fix:

  • Obtain your hostname:
    $ hostname
    mycoolname
  • Linux & Macintosh : Add the results of hostname to your /etc/hosts file. Like so:
    $ sudo vi /etc/hosts
    
    127.0.0.1  localhost
    
    # The following lines are desirable for IPv6 capable hosts
    ::1        localhost ip6-localhost ip6-loopback
    ff02::1    ip6-allnodes
    ff02::2    ip6-allrouters
    
    127.0.0.1  mycoolname  # <--- add this line to the end of your hosts file
    
    Everyones host file is different. But the results of adding the necessary line described above will be the same.
  • Macintosh only, 2nd method:
    sudo scutil --set HostName mycoolname
    
    We have received reports where the second method sometimes does not work.

@giordano
Copy link
Member

This also looks similar to pmodels/mpich#6547. pmodels/mpich#6547 (comment) suggested to export the environment variables

MPIR_CVAR_OFI_SKIP_IPV6=0
FI_PROVIDER=tcp

as an alternative work around, without messing up with hostname configuration. That bug was reportedly fixed in MPICH by pmodels/mpich#6558, which first appeared in v4.2.0, but you're using v4.1.2.

@Blumenkranz
Copy link
Author

This also looks similar to pmodels/mpich#6547. pmodels/mpich#6547 (comment) suggested to export the environment variables

MPIR_CVAR_OFI_SKIP_IPV6=0
FI_PROVIDER=tcp

as an alternative work around, without messing up with hostname configuration. That bug was reportedly fixed in MPICH by pmodels/mpich#6558, which first appeared in v4.2.0, but you're using v4.1.2.

Unfortunately, this work around failed for me. But the first one works. Thanks a lot for your reply!

@giordano
Copy link
Member

Can you try to revert your changes to /etc/hosts and update MPICH_jll to v4.2.0 (]add [email protected], if ]up doesn't upgrade it automatically)? That should also do the trick

@Blumenkranz
Copy link
Author

Can you try to revert your changes to /etc/hosts and update MPICH_jll to v4.2.0 (]add [email protected], if ]up doesn't upgrade it automatically)? That should also do the trick

It doesn't work.

@giordano
Copy link
Member

Alright, I'll open a PR to add a known issue to the documentation, but this is strictly not a bug in MPI.jl, rather in your MPI/system configuration, as other independent projects have experienced it as well.

@Blumenkranz
Copy link
Author

Alright, I'll open a PR to add a known issue to the documentation, but this is strictly not a bug in MPI.jl, rather in your MPI/system configuration, as other independent projects have experienced it as well.

I see. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants