Consider linking with mimalloc in release executables? #5561

TerrorJack · 2023-03-10T13:53:22Z

I've seen huge performance improvement when wasm-opt is linked with mimalloc and optimizes a big wasm module on a many-cores machine!

Result sans mimalloc:

$ time bench ./test.sh
benchmarking ./test.sh
time                 221.9 s    (218.3 s .. 224.6 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 219.3 s    (218.1 s .. 220.3 s)
std dev              1.411 s    (1.155 s .. 1.566 s)
variance introduced by outliers: 19% (moderately inflated)


real    58m35.860s
user    129m50.133s
sys     2395m41.639s

Result with mimalloc:

$ time bench ./test.sh 
benchmarking ./test.sh
time                 14.06 s    (13.38 s .. 14.86 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 13.94 s    (13.82 s .. 14.03 s)
std dev              123.8 ms   (76.09 ms .. 151.4 ms)
variance introduced by outliers: 19% (moderately inflated)


real    3m43.584s
user    45m38.783s
sys     0m40.349s

The sans mimalloc cases uses the official linux x64 binaries of version_112, while the with mimalloc case is compiled from the same version, but linked with mimalloc v2.0.9
The command is wasm-opt -Oz hello.wasm -o hello.opt.wasm, hello.wasm is 26MB
The benchmark is run with bench, which runs the same command multiple times and outputs the statistics above.
The test is conducted in a ubuntu:22.10 container on a server with AMD EPYC 7401P (48 logical cores) and 128GB memory.

The text was updated successfully, but these errors were encountered:

kripken · 2023-03-10T16:40:25Z

Very interesting!

Overall this make me think that maybe the issues we've seen with multithreading overhead are malloc contention between threads, like these: emscripten-core/emscripten#15727 #2740

It might be good to investigate two things here:

How easy mimalloc integration is (if it's a single file, and has a wasm port - which we need for binaryen.js - that would be ideal).
Whether we can reduce our malloc contention. We allocate Expression objects very efficiently in arenas, so this must be just other random small allocations that we do all over the place... if so then using more SmallSet/SmallVector might help. Perhaps there is a tool that can find which stack traces lead to most of these contending allocations.

tlively · 2023-03-10T16:59:21Z

Looks like Google publishes a heap profiler that might be useful for this: https://gperftools.github.io/gperftools/heapprofile.html

TerrorJack · 2023-03-10T17:20:46Z

Linking with mimalloc to replace libc builtin allocator only requires special link-time configuration, and doesn't require changing C/C++ source code at all. When targetting wasm, you don't need to do anything special, just live with the original libc allocator.

https://github.com/rui314/mold/blob/main/CMakeLists.txt#L138 is a good example for properly linking against mimalloc. Though it's even possible to not modify cmake config at all, just specify -DCMAKE_EXE_LINKER_FLAGS="-Wl,--push-state,--whole-archive,path/to/libmimalloc.a,--pop-state" for linux or -DCMAKE_EXE_LINKER_FLAGS="-Wl,-force_load,path/to/libmimalloc.a" for macOS at configure time.

MaxGraey · 2023-03-10T17:24:28Z

When targetting wasm, you don't need to do anything special, just live with the original libc allocator.

If I am not mistaken mimalloc supports WebAssembly but only with WASI

TerrorJack · 2023-03-10T17:25:22Z

If I am not mistaken mimalloc supports WebAssembly but only with WASI

You don't need mimalloc at all when targetting wasm.

kripken · 2023-03-10T17:28:25Z

You don't need mimalloc at all when targetting wasm

I think it could be useful in multithreaded wasm builds? Just like for native ones.

TerrorJack · 2023-03-10T17:29:41Z

I think it could be useful in multithreaded wasm builds? Just like for native ones.

That's correct, although the mimalloc codebase currently really just supports single-threaded wasm32-wasi.

kripken · 2023-03-10T18:01:52Z

I see, makes sense.

I'd be ok with just enabling mimalloc for non-wasm for now then, if we want to go that route.

TerrorJack · 2023-03-10T18:54:39Z

If anyone wants to give the mimalloc flavour a try, I've created a statically-linked x86_64-linux binary release of version_112 at https://nightly.link/type-dance/binaryen/actions/artifacts/593257094.zip. The build script is available at https://github.com/type-dance/binaryen/blob/main/build-alpine.sh.

kripken · 2023-03-15T20:18:41Z

Note: This issue is relevant for #4165

@tlively I looked into tcmalloc to profile our mallocs. I found some possible improvements and will open PRs, but I'm not sure how big an impact they will have. An issue is that tcmalloc measures the size of allocations, not the number of malloc calls, and we might have very many small allocations or quickly-freed ones that don't show up in that type of profiling.

tlively · 2023-03-15T20:41:47Z

I just found mutrace for profiling lock contention specifically. I'd be very interested to see the results here!

http://0pointer.de/blog/projects/mutrace.html

kripken · 2023-03-15T21:48:39Z

Interesting!

It says this:

Due to the way mutrace works we cannot profile mutexes that are used internally in glibc, such as those used for synchronizing stdio and suchlike.

So I tried both with the normal system malloc (which it seems it may not be able to profile) and with a userspace malloc (tcmalloc). The results did not change much. I guess that is consistent with malloc contention not actually being an issue on some machines, perhaps because their mallocs have low contention (like the default Linux one on my machine, and tcmalloc).

So to really dig into malloc performance we'd need to run mutrace on a system that sees the slowdown. @TerrorJack perhaps you can try that?

But the results on my machine are interesting about non-malloc mutexes. There is one top mutex by far:

mutrace: Showing 10 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
      10 41885864 16900980 10783056     7334.222        0.000       36.992 M-.--.
      16     2470     2286      713      114.601        0.046       24.002 M-.--.
       4      734      366        0    68689.275       93.582    19386.295 M-.--.

That mutex 10 is

Mutex #10 (0x0x7f54728ffca0) first referenced by:
	libmutrace.so(pthread_mutex_lock+0x46) [0x7f54729c1576]
	libbinaryen.so(+0x9b61d7) [0x7f54723b61d7]
	libbinaryen.so(_ZN4wasm4TypeC1ENS_8HeapTypeENS_11NullabilityE+0x41) [0x7f54723b6ae1]
	libbinaryen.so(_ZN4wasm17WasmBinaryBuilder12getBasicTypeEiRNS_4TypeE+0x113) [0x7f547233aed3]
	libbinaryen.so(+0x93d529) [0x7f547233d529]
	libbinaryen.so(_ZN4wasm17WasmBinaryBuilder9readTypesEv+0x4dd) [0x7f5472340f9d]
	libbinaryen.so(_ZN4wasm17WasmBinaryBuilder4readEv+0x748) [0x7f547235fe48]
	libbinaryen.so(_ZN4wasm12ModuleReader14readBinaryDataERSt6vectorIcSaIcEERNS_6ModuleENSt7__cxx1112basic_stringIcSt11char_traitsIcES2_EE+0x5c) [0x7f54723732cc]
	libbinaryen.so(_ZN4wasm12ModuleReader10readBinaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_6ModuleES6_+0x73) [0x7f5472373503]
	libbinaryen.so(_ZN4wasm12ModuleReader4readENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_6ModuleES6_+0x17a) [0x7f5472373e1a]
	wasm-opt(+0x29a06) [0x5644f215fa06]
	libc.so.6(+0x2718a) [0x7f547144618a]

Which is the Type mutex used in wasm::Type::Type(wasm::HeapType, wasm::Nullability). Perhaps that can be improved @tlively ?

(the one after it is the thread pool mutex, wasm::ThreadPool::initialize(unsigned long), which I doubt we can improve, but also it's orders of magnitude less frequent)

kripken · 2023-03-15T21:49:39Z

(that is a profile on running wasm-opt -g -all --closed-world -tnh -O3 --type-ssa --gufa -O3 --type-merging on a large Dart testcase of Wasm GC, so it does stress type optimizations I guess)

tlively · 2023-03-15T22:15:53Z

Makes sense, this confirms a suspicion I had that the global type cache is extremely contended. We frequently do things like type == Type(heapType, Nullable) or Type::isSubType(type, Type(heapType, Nullable)), and the creation of those temporary Type objects require taking the lock. I'll take an action item to try to purge these patterns from the code base.

kripken · 2023-03-15T22:17:06Z

As another datapoint, I also ran with plain dlmalloc which doesn't have any complex per-thread pools AFAIK. But the results are the same as my system allocator and tcmalloc. So somehow I just don't see malloc contention on my machine...

arsnyder16 · 2023-03-16T16:28:31Z

Just another data point to consider. I did some crude timings testing wasm-opt with using LD_PRELOAD and measured mimalloc vs jemalloc vs glibc allocators.

The pre optimized wasm file is ~117MB

mimalloc

real    1m28.478s
user    13m48.665s
sys     0m1.572s
jemalloc

real    1m0.543s
user    7m59.951s
sys     0m0.931s
glibc

real    1m25.956s
user    9m40.555s
sys     0m1.791s

export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 
echo "mimalloc"
time /root/emsdk/upstream/bin/wasm-opt --strip-dwarf --post-emscripten -Os  --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers  \
    perf.wasm -o perf-mimalloc.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 
echo "jemalloc"
time /root/emsdk/upstream/bin/wasm-opt --strip-dwarf --post-emscripten -Os  --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers  \
    perf.wasm -o perf-jemalloc.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext
unset LD_PRELOAD
echo "glibc"
time /root/emsdk/upstream/bin/wasm-opt --strip-dwarf --post-emscripten -Os  --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers  \
    perf.wasm -o perf.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext

Here is output running with BINARYEN_PASS_DEBUG=1
output.txt

TerrorJack · 2023-03-17T15:35:07Z

@arsnyder16 Thanks for conducting the experiment. Have you actually confirmed mimalloc is used at runtime by setting MIMALLOC_VERBOSE=1? Dynamic override via LD_PRELOAD doesn't seem to work at all for some reason:

$ env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./wasm-opt --version
wasm-opt version 112 (version_112)

arsnyder16 · 2023-03-17T19:33:39Z

Seems to be working fine for me

# MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 /root/emsdk/upstream/bin/wasm-opt --version
mimalloc: option 'show_errors': 1
mimalloc: option 'show_stats': 0
mimalloc: option 'eager_commit': 1
mimalloc: option 'deprecated_eager_region_commit': 0
mimalloc: option 'deprecated_reset_decommits': 0
mimalloc: option 'large_os_pages': 0
mimalloc: option 'reserve_huge_os_pages': 0
mimalloc: option 'reserve_huge_os_pages_at': -1
mimalloc: option 'reserve_os_memory': 0
mimalloc: option 'segment_cache': 0
mimalloc: option 'page_reset': 0
mimalloc: option 'abandoned_page_decommit': 0
mimalloc: option 'deprecated_segment_reset': 0
mimalloc: option 'eager_commit_delay': 1
mimalloc: option 'decommit_delay': 25
mimalloc: option 'use_numa_nodes': 0
mimalloc: option 'limit_os_alloc': 0
mimalloc: option 'os_tag': 100
mimalloc: option 'max_errors': 16
mimalloc: option 'max_warnings': 16
mimalloc: option 'allow_decommit': 1
mimalloc: option 'segment_decommit_delay': 500
mimalloc: option 'decommit_extend_delay': 2
mimalloc: process init: 0x7f1a07b41f40
mimalloc: debug level : 2
mimalloc: secure level: 0
mimalloc: using 1 numa regions
wasm-opt version 112 (version_112-45-g9dcdd47a2)
heap stats:    peak      total      freed    current       unit      count
normal   1:    552 B      552 B      552 B        0          8 B       69      ok
normal   4:    4.7 KiB    4.8 KiB    4.8 KiB     32 B       32 B      155      not all freed!
normal   6:   35.6 KiB   48.2 KiB   37.5 KiB   10.7 KiB     48 B      1.0 K    not all freed!
normal   8:    9.4 KiB   22.6 KiB   16.8 KiB    5.7 KiB     64 B      361      not all freed!
normal   9:    7.2 KiB   14.2 KiB   10.4 KiB    3.8 KiB     80 B      182      not all freed!
normal  10:    2.7 KiB    4.7 KiB    3.2 KiB    1.5 KiB     96 B       50      not all freed!
normal  11:    1.8 KiB    3.9 KiB    2.7 KiB    1.2 KiB    112 B       36      not all freed!
normal  12:   18.6 KiB   19.5 KiB   18.5 KiB    1.0 KiB    128 B      156      not all freed!
normal  13:    1.5 KiB    3.1 KiB    2.1 KiB    960 B      160 B       20      not all freed!
normal  14:    768 B      1.3 KiB    960 B      384 B      192 B        7      not all freed!
normal  15:    672 B      1.7 KiB    1.3 KiB    448 B      224 B        8      not all freed!
normal  16:    512 B      512 B      256 B      256 B      256 B        2      not all freed!
normal  17:    320 B      320 B      320 B        0        320 B        1      ok
normal  18:    768 B      768 B      768 B        0        384 B        2      ok
normal  19:    448 B      896 B      896 B        0        448 B        2      ok
normal  21:    640 B      640 B      640 B        0        640 B        1      ok
normal  23:    1.7 KiB    3.5 KiB    3.5 KiB      0        896 B        4      ok
normal  25:    1.2 KiB    2.5 KiB    1.2 KiB    1.2 KiB    1.2 KiB      2      not all freed!
normal  27:    3.5 KiB    7.0 KiB    7.0 KiB      0        1.7 KiB      4      ok
normal  29:    2.5 KiB    2.5 KiB    2.5 KiB      0        2.5 KiB      1      ok
normal  31:   10.5 KiB   14.0 KiB   14.0 KiB      0        3.5 KiB      4      ok
normal  33:    5.0 KiB    5.0 KiB    5.0 KiB      0        5.0 KiB      1      ok
normal  35:   14.0 KiB   14.0 KiB   14.0 KiB      0        7.0 KiB      2      ok
normal  37:   10.0 KiB   10.0 KiB   10.0 KiB      0       10.0 KiB      1      ok
normal  41:   20.0 KiB   20.0 KiB   20.0 KiB      0       20.0 KiB      1      ok
normal  45:   40.1 KiB   40.1 KiB      0       40.1 KiB   40.1 KiB      1      not all freed!

heap stats:    peak      total      freed    current       unit      count
    normal:  142.7 Ki   231.1 Ki   166.9 Ki    64.2 Ki     112 B      2.1 K    not all freed!
     large:      0          0          0          0                            ok
      huge:      0          0          0          0                            ok
     total:  142.7 KiB  231.1 KiB  166.9 KiB   64.2 KiB                        not all freed!
malloc req:  128.6 KiB  206.6 KiB  147.9 KiB   58.6 KiB                        not all freed!

  reserved:   64.0 MiB   64.0 MiB      0       64.0 MiB                        not all freed!
 committed:   64.0 MiB   64.0 MiB      0       64.0 MiB                        not all freed!
     reset:      0          0          0          0                            ok
   touched:  357.5 KiB  379.8 KiB   99.5 KiB  280.3 KiB                        not all freed!
  segments:      1          1          0          1                            not all freed!
-abandoned:      0          0          0          0                            ok
   -cached:      0          0          0          0                            ok
     pages:     23         29         16         13                            not all freed!
-abandoned:      0          0          0          0                            ok
 -extended:     48
 -noretire:     22
     mmaps:      1
   commits:      0
   threads:      0          0          0          0                            ok
  searches:     0.3 avg
numa nodes:       1
   elapsed:       0.002 s
   process: user: 0.002 s, system: 0.000 s, faults: 0, rss: 9.0 MiB, commit: 64.0 MiB
mimalloc: process done: 0x7f1a07b41f40

This makes the pass 2-3% faster in some measurements I did locally. Noticed when profiling for #5561 (comment) Helps #4165

This saves the work of freeing and allocating for all the other maps. This is a code path that is used by several passes so it showed up in profiling for #5561

This makes the pass 2-3% faster in some measurements I did locally. Noticed when profiling for WebAssembly#5561 (comment) Helps WebAssembly#4165

This saves the work of freeing and allocating for all the other maps. This is a code path that is used by several passes so it showed up in profiling for WebAssembly#5561

kripken mentioned this issue Mar 15, 2023

Use SmallVector in TypeUpdating::updateParamTypes() #5579

Open

kripken mentioned this issue Mar 21, 2023

Use a SmallVector in MergeBlocks [NFC] #5594

Merged

kripken added a commit that referenced this issue Mar 21, 2023

Use a SmallVector in MergeBlocks [NFC] (#5594)

39e3490

This makes the pass 2-3% faster in some measurements I did locally. Noticed when profiling for #5561 (comment) Helps #4165

kripken mentioned this issue Apr 4, 2023

Only update functions in optimizeAfterInlining() #5624

Merged

kripken added a commit that referenced this issue Apr 5, 2023

Only update functions in optimizeAfterInlining() (#5624)

30097e5

This saves the work of freeing and allocating for all the other maps. This is a code path that is used by several passes so it showed up in profiling for #5561

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider linking with mimalloc in release executables? #5561

Consider linking with mimalloc in release executables? #5561

TerrorJack commented Mar 10, 2023

kripken commented Mar 10, 2023

tlively commented Mar 10, 2023

TerrorJack commented Mar 10, 2023

MaxGraey commented Mar 10, 2023

TerrorJack commented Mar 10, 2023

kripken commented Mar 10, 2023

TerrorJack commented Mar 10, 2023

kripken commented Mar 10, 2023

TerrorJack commented Mar 10, 2023

kripken commented Mar 15, 2023

tlively commented Mar 15, 2023

kripken commented Mar 15, 2023

kripken commented Mar 15, 2023

tlively commented Mar 15, 2023

kripken commented Mar 15, 2023

arsnyder16 commented Mar 16, 2023 •

edited

Loading

TerrorJack commented Mar 17, 2023

arsnyder16 commented Mar 17, 2023

Consider linking with mimalloc in release executables? #5561

Consider linking with mimalloc in release executables? #5561

Comments

TerrorJack commented Mar 10, 2023

kripken commented Mar 10, 2023

tlively commented Mar 10, 2023

TerrorJack commented Mar 10, 2023

MaxGraey commented Mar 10, 2023

TerrorJack commented Mar 10, 2023

kripken commented Mar 10, 2023

TerrorJack commented Mar 10, 2023

kripken commented Mar 10, 2023

TerrorJack commented Mar 10, 2023

kripken commented Mar 15, 2023

tlively commented Mar 15, 2023

kripken commented Mar 15, 2023

kripken commented Mar 15, 2023

tlively commented Mar 15, 2023

kripken commented Mar 15, 2023

arsnyder16 commented Mar 16, 2023 • edited Loading

TerrorJack commented Mar 17, 2023

arsnyder16 commented Mar 17, 2023

arsnyder16 commented Mar 16, 2023 •

edited

Loading