-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interference with Distributed.jl on Windows #205
Comments
Do we have any known version bounds on the issue? |
I saw it all the way through 1.8.0 in the SymbolicRegression.jl tests - I couldn't test earlier due to compatibility requirements with other packages |
Another interesting thing I might have just discovered is that the issue only appears when you try to create more worker processes than there are cores. With |
@mkitti I wonder if the issue is actually FunctionWrappers.jl which is a dependency (and not of any other packages in SymbolicRegression.jl). It seems to do more lower-level stuff that could get broken by Distributed.jl. |
Nope, it's just ReverseDiff.jl. FunctionWrappers.jl by itself doesn't trigger the error: https://github.com/MilesCranmer/SymbolicRegression.jl/runs/7960936888?check_suite_focus=true#step:6:232. There aren't any other dependencies which are not also imported by SymbolicRegression, so it is definitely some code in this repo... |
I just tried this and it works fine for me. using Pkg
using Distributed
pkg"activate --temp"
pkg"add [email protected]"
import ReverseDiff
procs = addprocs(Sys.CPU_THREADS ÷ 2)
# Activate env on workers:
project_path = splitdir(Pkg.project().path)[1]
@everywhere procs begin
Base.MainInclude.eval(
quote
using Pkg
Pkg.activate($$project_path)
end,
)
end
# Import package on workers:
@everywhere procs begin
Base.MainInclude.eval(import ReverseDiff)
end In this case julia> length(procs)
48
julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab7 (2022-08-17 13:38 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 48 × Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake)
Threads: 1 on 96 virtual cores |
Can you try |
The below also works. using Pkg
using Distributed
pkg"activate --temp"
pkg"add ReverseDiff"
import ReverseDiff
procs = addprocs(Sys.CPU_THREADS ÷ 2)
# Activate env on workers:
project_path = splitdir(Pkg.project().path)[1]
@everywhere procs begin
Base.MainInclude.eval(
quote
using Pkg
Pkg.activate($$project_path)
end,
)
end
# Import package on workers:
@everywhere procs begin
Base.MainInclude.eval(import ReverseDiff)
end
julia> pkg"st"
Status `C:\Users\kittisopikulm\AppData\Local\Temp\jl_qGGQF9\Project.toml`
[37e2e3b7] ReverseDiff v1.14.1 |
This crashes. using Pkg
using Distributed
pkg"activate --temp"
pkg"add ReverseDiff"
import ReverseDiff
procs = addprocs(Sys.CPU_THREADS)
# Activate env on workers:
project_path = splitdir(Pkg.project().path)[1]
@everywhere procs begin
Base.MainInclude.eval(
quote
using Pkg
Pkg.activate($$project_path)
end,
)
end
# Import package on workers:
@everywhere procs begin
Base.MainInclude.eval(import ReverseDiff)
end |
The weird thing is that it works fine if you have |
I just had to reboot my computer since it crashed to a black screen. Switching to Julia 1.7.3 and trying using Pkg
using Distributed
pkg"activate --temp"
pkg"add [email protected]"
import ReverseDiff
procs = addprocs(Sys.CPU_THREADS)
# Activate env on workers:
project_path = splitdir(Pkg.project().path)[1]
@everywhere procs begin
Base.MainInclude.eval(
quote
using Pkg
Pkg.activate($$project_path)
end,
)
end
# Import package on workers:
@everywhere procs begin
Base.MainInclude.eval(import ReverseDiff)
end
julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake) This works! |
Now trying using Pkg
using Distributed
pkg"activate --temp"
pkg"add ReverseDiff"
import ReverseDiff
procs = addprocs(Sys.CPU_THREADS)
# Activate env on workers:
project_path = splitdir(Pkg.project().path)[1]
@everywhere procs begin
Base.MainInclude.eval(
quote
using Pkg
Pkg.activate($$project_path)
end,
)
end
# Import package on workers:
@everywhere procs begin
Base.MainInclude.eval(import ReverseDiff)
end
julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake) This also works! |
Trying... Julia 1.8 again: using Pkg
using Distributed
pkg"activate --temp"
pkg"add ReverseDiff"
import ReverseDiff
using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(1)
procs = addprocs(Sys.CPU_THREADS)
# Activate env on workers:
project_path = splitdir(Pkg.project().path)[1]
@everywhere procs begin
Base.MainInclude.eval(
quote
using Pkg
Pkg.activate($$project_path)
using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(1)
end,
)
end
# Import package on workers:
@everywhere procs begin
Base.MainInclude.eval(import ReverseDiff)
end
Crashed with the following trace
Before it would crash Windows completely, this is an improvement. |
Trying with Julia 1.8... using Pkg
using Distributed
pkg"activate --temp"
pkg"add ReverseDiff"
import ReverseDiff
using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(1)
procs = addprocs(Sys.CPU_THREADS÷3*2 ) # 64 processes
# Activate env on workers:
project_path = splitdir(Pkg.project().path)[1]
@everywhere procs begin
Base.MainInclude.eval(
quote
using Pkg
Pkg.activate($$project_path)
using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(1)
end,
)
end
# Import package on workers:
@everywhere procs begin
Base.MainInclude.eval(import ReverseDiff)
end This works! |
Trying Julia 1.8 with more BLAS threads. using Pkg
using Distributed
pkg"activate --temp"
pkg"add ReverseDiff"
import ReverseDiff
using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(8)
procs = addprocs(Sys.CPU_THREADS÷3*2 ) # 64 processes
# Activate env on workers:
project_path = splitdir(Pkg.project().path)[1]
@everywhere procs begin
Base.MainInclude.eval(
quote
using Pkg
Pkg.activate($$project_path)
using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(8)
end,
)
end
# Import package on workers:
@everywhere procs begin
Base.MainInclude.eval(import ReverseDiff)
end This works! |
I tried Julia 1.8 with 32 BLAS threads per process and 64 processes and it crashed once. The second time I tried, it worked.
|
Note that I have two sockets and hyperthreading. Also the memory issues are not due to lack of RAM. @MilesCranmer, my guess is that we are hitting some resource contention issues with a high number of processes. Julia 1.8 starts more BLAS threads by default (32 versus 8). Perhaps this package fails because it is the first one to use LinearAlgebra? The heuristic therefore is to reduce the number of processes if on Windows. Also consider reducing the number of BLAS threads, especially when using Julia 1.8 or greater. There is a |
I'm not sure... I've tried importing SymbolicRegression this way (with double the number of procs as available), and it with all its tree of ~100 dependencies import just fine with no crashes. Presumably all of those dependencies would be more resources than required for ReverseDiff.jl, right? Is there some specific thing in loading ReverseDiff.jl that is very expensive? Maybe try doing a |
The other question: if you have two instances of Julia running, and each launches 4 processes on a 4-core machine, bringing the total to 8 processes, does that mean Julia will crash? This feels like a bug even if it is a resource contention issue - having more processes than cores should still work, even if it is slower. |
My guess is that it might have the same result. The question is what is the resource in contention. We are combining multiprocessing with multithreading here and oversubscribing each core with at least 64 BLAS threads per core. Another issue may be OpenBLAS. It may be useful to try the Intel MKL instead. I also suspect that hyperthreading is an issue here. |
This reminds me of issues I've seen on Windows before. For a while test of AbstractMCMC segfaulted on Windows which I couldn't understand until I came across JuliaLang/Pkg.jl#2323 and JuliaLang/Pkg.jl#2366. My impression was that there is an upstream bug and it is not safe to use too many tasks on Windows. I used the same approach as in Pkg to limit the number of processes in our tests: https://github.com/TuringLang/AbstractMCMC.jl/blob/a9ba4d3b1c0314393532dbd792befa55750d9a0f/test/sample.jl#L224 It seems that fixed the issues. |
Very interesting! Did that test fail right at the import step too, or after running something? (Also, does it implicitly depend on ReverseDiff.jl at all?) |
No, it's not related to ReverseDiff. Only to Windows + |
Following some detective work with the help of @rikhuijzer and @ChrisRackauckas on https://discourse.julialang.org/t/github-action-mysteriously-starts-breaking-on-windows/86048, it seems as though ReverseDiff.jl is interfering with Distributed.jl and causing some issues.
When I import ReverseDiff.jl into SymbolicRegression.jl (either directly, or if it is imported in a dependency of a dependency) the unit tests on Windows break. These unit tests use Distributed.jl to dynamically allocate worker processes. No other operating systems are affected.
This bug can reproduced with the following code, which dynamically allocates some worker processes, activates the current environment on each, and then imports ReverseDiff.jl on each.
This code will work on Ubuntu and macOS, and works for every other package I've tested it on (all the dependencies of SymbolicRegression.jl are loaded this way). It is just the combination of Windows + ReverseDiff.jl that produce an error for some weird reason. You can see the error log here: https://github.com/MilesCranmer/SymbolicRegression.jl/runs/7943853737?check_suite_focus=true#step:6:232.
I really have no idea what could be causing this. Any clue? Does ReverseDiff.jl change any core functions in any way that would impact this?
Thanks!
Miles
The text was updated successfully, but these errors were encountered: