Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider linking with mimalloc in release executables? #5561

Open
TerrorJack opened this issue Mar 10, 2023 · 18 comments
Open

Consider linking with mimalloc in release executables? #5561

TerrorJack opened this issue Mar 10, 2023 · 18 comments

Comments

@TerrorJack
Copy link
Contributor

I've seen huge performance improvement when wasm-opt is linked with mimalloc and optimizes a big wasm module on a many-cores machine!

Result sans mimalloc:

$ time bench ./test.sh
benchmarking ./test.sh
time                 221.9 s    (218.3 s .. 224.6 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 219.3 s    (218.1 s .. 220.3 s)
std dev              1.411 s    (1.155 s .. 1.566 s)
variance introduced by outliers: 19% (moderately inflated)


real    58m35.860s
user    129m50.133s
sys     2395m41.639s

Result with mimalloc:

$ time bench ./test.sh 
benchmarking ./test.sh
time                 14.06 s    (13.38 s .. 14.86 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 13.94 s    (13.82 s .. 14.03 s)
std dev              123.8 ms   (76.09 ms .. 151.4 ms)
variance introduced by outliers: 19% (moderately inflated)


real    3m43.584s
user    45m38.783s
sys     0m40.349s
  • The sans mimalloc cases uses the official linux x64 binaries of version_112, while the with mimalloc case is compiled from the same version, but linked with mimalloc v2.0.9
  • The command is wasm-opt -Oz hello.wasm -o hello.opt.wasm, hello.wasm is 26MB
  • The benchmark is run with bench, which runs the same command multiple times and outputs the statistics above.
  • The test is conducted in a ubuntu:22.10 container on a server with AMD EPYC 7401P (48 logical cores) and 128GB memory.
@kripken
Copy link
Member

kripken commented Mar 10, 2023

Very interesting!

Overall this make me think that maybe the issues we've seen with multithreading overhead are malloc contention between threads, like these: emscripten-core/emscripten#15727 #2740

It might be good to investigate two things here:

  • How easy mimalloc integration is (if it's a single file, and has a wasm port - which we need for binaryen.js - that would be ideal).
  • Whether we can reduce our malloc contention. We allocate Expression objects very efficiently in arenas, so this must be just other random small allocations that we do all over the place... if so then using more SmallSet/SmallVector might help. Perhaps there is a tool that can find which stack traces lead to most of these contending allocations.

@tlively
Copy link
Member

tlively commented Mar 10, 2023

Looks like Google publishes a heap profiler that might be useful for this: https://gperftools.github.io/gperftools/heapprofile.html

@TerrorJack
Copy link
Contributor Author

Linking with mimalloc to replace libc builtin allocator only requires special link-time configuration, and doesn't require changing C/C++ source code at all. When targetting wasm, you don't need to do anything special, just live with the original libc allocator.

https://github.com/rui314/mold/blob/main/CMakeLists.txt#L138 is a good example for properly linking against mimalloc. Though it's even possible to not modify cmake config at all, just specify -DCMAKE_EXE_LINKER_FLAGS="-Wl,--push-state,--whole-archive,path/to/libmimalloc.a,--pop-state" for linux or -DCMAKE_EXE_LINKER_FLAGS="-Wl,-force_load,path/to/libmimalloc.a" for macOS at configure time.

@MaxGraey
Copy link
Contributor

When targetting wasm, you don't need to do anything special, just live with the original libc allocator.

If I am not mistaken mimalloc supports WebAssembly but only with WASI

@TerrorJack
Copy link
Contributor Author

If I am not mistaken mimalloc supports WebAssembly but only with WASI

You don't need mimalloc at all when targetting wasm.

@kripken
Copy link
Member

kripken commented Mar 10, 2023

You don't need mimalloc at all when targetting wasm

I think it could be useful in multithreaded wasm builds? Just like for native ones.

@TerrorJack
Copy link
Contributor Author

I think it could be useful in multithreaded wasm builds? Just like for native ones.

That's correct, although the mimalloc codebase currently really just supports single-threaded wasm32-wasi.

@kripken
Copy link
Member

kripken commented Mar 10, 2023

I see, makes sense.

I'd be ok with just enabling mimalloc for non-wasm for now then, if we want to go that route.

@TerrorJack
Copy link
Contributor Author

If anyone wants to give the mimalloc flavour a try, I've created a statically-linked x86_64-linux binary release of version_112 at https://nightly.link/type-dance/binaryen/actions/artifacts/593257094.zip. The build script is available at https://github.com/type-dance/binaryen/blob/main/build-alpine.sh.

@kripken
Copy link
Member

kripken commented Mar 15, 2023

Note: This issue is relevant for #4165

@tlively I looked into tcmalloc to profile our mallocs. I found some possible improvements and will open PRs, but I'm not sure how big an impact they will have. An issue is that tcmalloc measures the size of allocations, not the number of malloc calls, and we might have very many small allocations or quickly-freed ones that don't show up in that type of profiling.

@tlively
Copy link
Member

tlively commented Mar 15, 2023

I just found mutrace for profiling lock contention specifically. I'd be very interested to see the results here!

http://0pointer.de/blog/projects/mutrace.html

@kripken
Copy link
Member

kripken commented Mar 15, 2023

Interesting!

It says this:

Due to the way mutrace works we cannot profile mutexes that are used internally in glibc, such as those used for synchronizing stdio and suchlike.

So I tried both with the normal system malloc (which it seems it may not be able to profile) and with a userspace malloc (tcmalloc). The results did not change much. I guess that is consistent with malloc contention not actually being an issue on some machines, perhaps because their mallocs have low contention (like the default Linux one on my machine, and tcmalloc).

So to really dig into malloc performance we'd need to run mutrace on a system that sees the slowdown. @TerrorJack perhaps you can try that?

But the results on my machine are interesting about non-malloc mutexes. There is one top mutex by far:

mutrace: Showing 10 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
      10 41885864 16900980 10783056     7334.222        0.000       36.992 M-.--.
      16     2470     2286      713      114.601        0.046       24.002 M-.--.
       4      734      366        0    68689.275       93.582    19386.295 M-.--.

That mutex 10 is

Mutex #10 (0x0x7f54728ffca0) first referenced by:
	libmutrace.so(pthread_mutex_lock+0x46) [0x7f54729c1576]
	libbinaryen.so(+0x9b61d7) [0x7f54723b61d7]
	libbinaryen.so(_ZN4wasm4TypeC1ENS_8HeapTypeENS_11NullabilityE+0x41) [0x7f54723b6ae1]
	libbinaryen.so(_ZN4wasm17WasmBinaryBuilder12getBasicTypeEiRNS_4TypeE+0x113) [0x7f547233aed3]
	libbinaryen.so(+0x93d529) [0x7f547233d529]
	libbinaryen.so(_ZN4wasm17WasmBinaryBuilder9readTypesEv+0x4dd) [0x7f5472340f9d]
	libbinaryen.so(_ZN4wasm17WasmBinaryBuilder4readEv+0x748) [0x7f547235fe48]
	libbinaryen.so(_ZN4wasm12ModuleReader14readBinaryDataERSt6vectorIcSaIcEERNS_6ModuleENSt7__cxx1112basic_stringIcSt11char_traitsIcES2_EE+0x5c) [0x7f54723732cc]
	libbinaryen.so(_ZN4wasm12ModuleReader10readBinaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_6ModuleES6_+0x73) [0x7f5472373503]
	libbinaryen.so(_ZN4wasm12ModuleReader4readENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_6ModuleES6_+0x17a) [0x7f5472373e1a]
	wasm-opt(+0x29a06) [0x5644f215fa06]
	libc.so.6(+0x2718a) [0x7f547144618a]

Which is the Type mutex used in wasm::Type::Type(wasm::HeapType, wasm::Nullability). Perhaps that can be improved @tlively ?

(the one after it is the thread pool mutex, wasm::ThreadPool::initialize(unsigned long), which I doubt we can improve, but also it's orders of magnitude less frequent)

@kripken
Copy link
Member

kripken commented Mar 15, 2023

(that is a profile on running wasm-opt -g -all --closed-world -tnh -O3 --type-ssa --gufa -O3 --type-merging on a large Dart testcase of Wasm GC, so it does stress type optimizations I guess)

@tlively
Copy link
Member

tlively commented Mar 15, 2023

Makes sense, this confirms a suspicion I had that the global type cache is extremely contended. We frequently do things like type == Type(heapType, Nullable) or Type::isSubType(type, Type(heapType, Nullable)), and the creation of those temporary Type objects require taking the lock. I'll take an action item to try to purge these patterns from the code base.

@kripken
Copy link
Member

kripken commented Mar 15, 2023

As another datapoint, I also ran with plain dlmalloc which doesn't have any complex per-thread pools AFAIK. But the results are the same as my system allocator and tcmalloc. So somehow I just don't see malloc contention on my machine...

@arsnyder16
Copy link
Contributor

arsnyder16 commented Mar 16, 2023

Just another data point to consider. I did some crude timings testing wasm-opt with using LD_PRELOAD and measured mimalloc vs jemalloc vs glibc allocators.

The pre optimized wasm file is ~117MB

mimalloc

real    1m28.478s
user    13m48.665s
sys     0m1.572s
jemalloc

real    1m0.543s
user    7m59.951s
sys     0m0.931s
glibc

real    1m25.956s
user    9m40.555s
sys     0m1.791s
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 
echo "mimalloc"
time /root/emsdk/upstream/bin/wasm-opt --strip-dwarf --post-emscripten -Os  --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers  \
    perf.wasm -o perf-mimalloc.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 
echo "jemalloc"
time /root/emsdk/upstream/bin/wasm-opt --strip-dwarf --post-emscripten -Os  --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers  \
    perf.wasm -o perf-jemalloc.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext
unset LD_PRELOAD
echo "glibc"
time /root/emsdk/upstream/bin/wasm-opt --strip-dwarf --post-emscripten -Os  --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers  \
    perf.wasm -o perf.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext

Here is output running with BINARYEN_PASS_DEBUG=1
output.txt

@TerrorJack
Copy link
Contributor Author

@arsnyder16 Thanks for conducting the experiment. Have you actually confirmed mimalloc is used at runtime by setting MIMALLOC_VERBOSE=1? Dynamic override via LD_PRELOAD doesn't seem to work at all for some reason:

$ env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./wasm-opt --version
wasm-opt version 112 (version_112)

@arsnyder16
Copy link
Contributor

Seems to be working fine for me

# MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 /root/emsdk/upstream/bin/wasm-opt --version
mimalloc: option 'show_errors': 1
mimalloc: option 'show_stats': 0
mimalloc: option 'eager_commit': 1
mimalloc: option 'deprecated_eager_region_commit': 0
mimalloc: option 'deprecated_reset_decommits': 0
mimalloc: option 'large_os_pages': 0
mimalloc: option 'reserve_huge_os_pages': 0
mimalloc: option 'reserve_huge_os_pages_at': -1
mimalloc: option 'reserve_os_memory': 0
mimalloc: option 'segment_cache': 0
mimalloc: option 'page_reset': 0
mimalloc: option 'abandoned_page_decommit': 0
mimalloc: option 'deprecated_segment_reset': 0
mimalloc: option 'eager_commit_delay': 1
mimalloc: option 'decommit_delay': 25
mimalloc: option 'use_numa_nodes': 0
mimalloc: option 'limit_os_alloc': 0
mimalloc: option 'os_tag': 100
mimalloc: option 'max_errors': 16
mimalloc: option 'max_warnings': 16
mimalloc: option 'allow_decommit': 1
mimalloc: option 'segment_decommit_delay': 500
mimalloc: option 'decommit_extend_delay': 2
mimalloc: process init: 0x7f1a07b41f40
mimalloc: debug level : 2
mimalloc: secure level: 0
mimalloc: using 1 numa regions
wasm-opt version 112 (version_112-45-g9dcdd47a2)
heap stats:    peak      total      freed    current       unit      count
normal   1:    552 B      552 B      552 B        0          8 B       69      ok
normal   4:    4.7 KiB    4.8 KiB    4.8 KiB     32 B       32 B      155      not all freed!
normal   6:   35.6 KiB   48.2 KiB   37.5 KiB   10.7 KiB     48 B      1.0 K    not all freed!
normal   8:    9.4 KiB   22.6 KiB   16.8 KiB    5.7 KiB     64 B      361      not all freed!
normal   9:    7.2 KiB   14.2 KiB   10.4 KiB    3.8 KiB     80 B      182      not all freed!
normal  10:    2.7 KiB    4.7 KiB    3.2 KiB    1.5 KiB     96 B       50      not all freed!
normal  11:    1.8 KiB    3.9 KiB    2.7 KiB    1.2 KiB    112 B       36      not all freed!
normal  12:   18.6 KiB   19.5 KiB   18.5 KiB    1.0 KiB    128 B      156      not all freed!
normal  13:    1.5 KiB    3.1 KiB    2.1 KiB    960 B      160 B       20      not all freed!
normal  14:    768 B      1.3 KiB    960 B      384 B      192 B        7      not all freed!
normal  15:    672 B      1.7 KiB    1.3 KiB    448 B      224 B        8      not all freed!
normal  16:    512 B      512 B      256 B      256 B      256 B        2      not all freed!
normal  17:    320 B      320 B      320 B        0        320 B        1      ok
normal  18:    768 B      768 B      768 B        0        384 B        2      ok
normal  19:    448 B      896 B      896 B        0        448 B        2      ok
normal  21:    640 B      640 B      640 B        0        640 B        1      ok
normal  23:    1.7 KiB    3.5 KiB    3.5 KiB      0        896 B        4      ok
normal  25:    1.2 KiB    2.5 KiB    1.2 KiB    1.2 KiB    1.2 KiB      2      not all freed!
normal  27:    3.5 KiB    7.0 KiB    7.0 KiB      0        1.7 KiB      4      ok
normal  29:    2.5 KiB    2.5 KiB    2.5 KiB      0        2.5 KiB      1      ok
normal  31:   10.5 KiB   14.0 KiB   14.0 KiB      0        3.5 KiB      4      ok
normal  33:    5.0 KiB    5.0 KiB    5.0 KiB      0        5.0 KiB      1      ok
normal  35:   14.0 KiB   14.0 KiB   14.0 KiB      0        7.0 KiB      2      ok
normal  37:   10.0 KiB   10.0 KiB   10.0 KiB      0       10.0 KiB      1      ok
normal  41:   20.0 KiB   20.0 KiB   20.0 KiB      0       20.0 KiB      1      ok
normal  45:   40.1 KiB   40.1 KiB      0       40.1 KiB   40.1 KiB      1      not all freed!

heap stats:    peak      total      freed    current       unit      count
    normal:  142.7 Ki   231.1 Ki   166.9 Ki    64.2 Ki     112 B      2.1 K    not all freed!
     large:      0          0          0          0                            ok
      huge:      0          0          0          0                            ok
     total:  142.7 KiB  231.1 KiB  166.9 KiB   64.2 KiB                        not all freed!
malloc req:  128.6 KiB  206.6 KiB  147.9 KiB   58.6 KiB                        not all freed!

  reserved:   64.0 MiB   64.0 MiB      0       64.0 MiB                        not all freed!
 committed:   64.0 MiB   64.0 MiB      0       64.0 MiB                        not all freed!
     reset:      0          0          0          0                            ok
   touched:  357.5 KiB  379.8 KiB   99.5 KiB  280.3 KiB                        not all freed!
  segments:      1          1          0          1                            not all freed!
-abandoned:      0          0          0          0                            ok
   -cached:      0          0          0          0                            ok
     pages:     23         29         16         13                            not all freed!
-abandoned:      0          0          0          0                            ok
 -extended:     48
 -noretire:     22
     mmaps:      1
   commits:      0
   threads:      0          0          0          0                            ok
  searches:     0.3 avg
numa nodes:       1
   elapsed:       0.002 s
   process: user: 0.002 s, system: 0.000 s, faults: 0, rss: 9.0 MiB, commit: 64.0 MiB
mimalloc: process done: 0x7f1a07b41f40

kripken added a commit that referenced this issue Mar 21, 2023
This makes the pass 2-3% faster in some measurements I did locally.

Noticed when profiling for #5561 (comment)

Helps #4165
kripken added a commit that referenced this issue Apr 5, 2023
This saves the work of freeing and allocating for all the other maps. This is a
code path that is used by several passes so it showed up in profiling for
#5561
radekdoulik pushed a commit to dotnet/binaryen that referenced this issue Jul 12, 2024
This makes the pass 2-3% faster in some measurements I did locally.

Noticed when profiling for WebAssembly#5561 (comment)

Helps WebAssembly#4165
radekdoulik pushed a commit to dotnet/binaryen that referenced this issue Jul 12, 2024
This saves the work of freeing and allocating for all the other maps. This is a
code path that is used by several passes so it showed up in profiling for
WebAssembly#5561
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants