-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider linking with mimalloc in release executables? #5561
Comments
Very interesting! Overall this make me think that maybe the issues we've seen with multithreading overhead are malloc contention between threads, like these: emscripten-core/emscripten#15727 #2740 It might be good to investigate two things here:
|
Looks like Google publishes a heap profiler that might be useful for this: https://gperftools.github.io/gperftools/heapprofile.html |
Linking with mimalloc to replace libc builtin allocator only requires special link-time configuration, and doesn't require changing C/C++ source code at all. When targetting wasm, you don't need to do anything special, just live with the original libc allocator. https://github.com/rui314/mold/blob/main/CMakeLists.txt#L138 is a good example for properly linking against mimalloc. Though it's even possible to not modify cmake config at all, just specify |
If I am not mistaken mimalloc supports WebAssembly but only with WASI |
You don't need mimalloc at all when targetting wasm. |
I think it could be useful in multithreaded wasm builds? Just like for native ones. |
That's correct, although the mimalloc codebase currently really just supports single-threaded wasm32-wasi. |
I see, makes sense. I'd be ok with just enabling mimalloc for non-wasm for now then, if we want to go that route. |
If anyone wants to give the mimalloc flavour a try, I've created a statically-linked x86_64-linux binary release of version_112 at https://nightly.link/type-dance/binaryen/actions/artifacts/593257094.zip. The build script is available at https://github.com/type-dance/binaryen/blob/main/build-alpine.sh. |
Note: This issue is relevant for #4165 @tlively I looked into tcmalloc to profile our mallocs. I found some possible improvements and will open PRs, but I'm not sure how big an impact they will have. An issue is that tcmalloc measures the size of allocations, not the number of malloc calls, and we might have very many small allocations or quickly-freed ones that don't show up in that type of profiling. |
I just found |
Interesting! It says this:
So I tried both with the normal system malloc (which it seems it may not be able to profile) and with a userspace malloc (tcmalloc). The results did not change much. I guess that is consistent with malloc contention not actually being an issue on some machines, perhaps because their mallocs have low contention (like the default Linux one on my machine, and tcmalloc). So to really dig into malloc performance we'd need to run mutrace on a system that sees the slowdown. @TerrorJack perhaps you can try that? But the results on my machine are interesting about non-malloc mutexes. There is one top mutex by far:
That mutex 10 is
Which is the Type mutex used in (the one after it is the thread pool mutex, |
(that is a profile on running |
Makes sense, this confirms a suspicion I had that the global type cache is extremely contended. We frequently do things like |
As another datapoint, I also ran with plain dlmalloc which doesn't have any complex per-thread pools AFAIK. But the results are the same as my system allocator and tcmalloc. So somehow I just don't see malloc contention on my machine... |
Just another data point to consider. I did some crude timings testing wasm-opt with using LD_PRELOAD and measured mimalloc vs jemalloc vs glibc allocators. The pre optimized wasm file is ~117MB
Here is output running with BINARYEN_PASS_DEBUG=1 |
@arsnyder16 Thanks for conducting the experiment. Have you actually confirmed $ env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./wasm-opt --version
wasm-opt version 112 (version_112) |
Seems to be working fine for me
|
This saves the work of freeing and allocating for all the other maps. This is a code path that is used by several passes so it showed up in profiling for #5561
This makes the pass 2-3% faster in some measurements I did locally. Noticed when profiling for WebAssembly#5561 (comment) Helps WebAssembly#4165
This saves the work of freeing and allocating for all the other maps. This is a code path that is used by several passes so it showed up in profiling for WebAssembly#5561
I've seen huge performance improvement when
wasm-opt
is linked withmimalloc
and optimizes a big wasm module on a many-cores machine!Result sans
mimalloc
:Result with
mimalloc
:version_112
, while the with mimalloc case is compiled from the same version, but linked with mimalloc v2.0.9wasm-opt -Oz hello.wasm -o hello.opt.wasm
,hello.wasm
is 26MBubuntu:22.10
container on a server with AMD EPYC 7401P (48 logical cores) and 128GB memory.The text was updated successfully, but these errors were encountered: