-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make compilation faster 🚀 #4165
Comments
I think precompute-propagate could be speedup also if interpreter will avoid exceptions: #2917 (comment) and uses monadic Expect / Result instead |
Another variant is rewrite simple constant propagation algorithm to Sparse Conditional Constant Propagation (SCCP pass in LLVM) which should be mush faster |
@MaxGraey Good idea about precompute-propagate. I checked that now. Running |
isSSA is not called anywhere. See #4165
Precompute has a mode in which it propagates results from local.sets to local.gets. That constructs a LocalGraph which is a non-trivial amount of work. We used to run multiple iterations of this, but investigation shows that such opportunities are extremely rare, as doing just a single propagation iteration has no effect on the entire emscripten benchmark suite, nor on j2cl output. Furthermore, we run this pass twice in the normal pipeline (once early, once late) so even if there are such opportunities they may be optimized already. And, --converge is a way to get additional iterations of all passes if a user wants that, so it makes sense not to costly work for more iterations automatically. In effect, 99.99% of the time before this pass we would create the LocalGraph twice: once the first time, then a second time only to see that we can't actually optimize anything further. This PR makes us only create it once, which makes precompute-propagate 10% faster on j2cl and even faster on other things like poppler (33%) and LLVM (29%). See the change in the test suite for an example of a case that does require more than one iteration to be optimized. Note that even there, we only manage to get benefit from a second iteration by doing something that overlaps with another pass (optimizing out an if with condition 0), which shows even more how unnecessary the extra work was. See #4165
By mistake the recent partial inlining work introduced quadratic time into the compiler: erasing a function from the list of functions takes linear time, which is why we have removeFunctions that does a group at a time. This isn't noticeable on small programs, but on j2cl output this makes the inlining-optimizing step 2x faster. See #4165
Hi there. I have some statistics for 500MB wasm under Windows 10 (sadly 2 times slower than Linux).
Most passes run as single threaded...
|
Thanks for the data @andrewevstyukhin , looks like by far the largest factor is asyncify:
I think that is because it flattens the IR, which is very costly by itself, does the transform, and then optimizes after to unflatten it. Asyncify will be replaced by stack switching eventually, so this will get resolved that way. But it could be improved a lot meanwhile, I think - currently Flatten generates many more locals than it needs to. |
Yes, 1.39.7: wasm-ld Best time is 481.723 seconds (asyncify 185.695 seconds, 39%). C++ assertions are really hard for recent versions in "release" builds: |
I can try to make an empty function with EMSCRIPTEN_KEEPALIVE and include function in ASYNCIFY_ADD then add calls... Found that |
@andrewevstyukhin I know there has been some discussion on the LLVM side about adding a way to annotate things in the source so that they show up in a wasm custom section. We could use that for asyncify, and also things like noinline. But I don't think anyone is working on it atm. I don't follow the LLVM side super-closely, though. |
I'm in doubt about annotations nowadays... Last week I did good enough settings just using
I wish to use C++20 coroutines instead of ASYNCIFY in future codebase. Asyncify consumes 40% of wasm-opt time. |
For the user interface stuff, maybe open a new issue on this in emscripten?
That is definitely recommended - Asyncify causes a large amount of overhead, and we hope to not need it once people can use either C++20 coroutines, or a future wasm stack switching proposal. |
Note: the output above, stuff like
is emitted by |
FWIW, I did an evening of profiling for our use case (i.e. linking the main .wasm of the Wonderland Engine runtime) and found that using the following structure instead of the 6 fake_set (std::vector with functions of a set)Disclaimer: this is very ugly hacky code, just to prove a point.
... will yield the following almost 25-30% improvements for our use case when on Windows and built with MSVC 2019 on RelWithDebInfo config: $ ./run-wasm-opt.sh
wasm-opt:
real 0m30.060s
user 0m0.000s
sys 0m0.015s
wasm-opt-fake_set:
real 0m21.832s
user 0m0.015s
sys 0m0.015s
Squareys@[...] /c/Repos/binaryen (main)
$ diff output-wasm-opt.wasmstripped output-wasm-opt-fake_set.wasmstripped
Squareys@[...] /c/Repos/binaryen (main)
$
Explanation: The sets are usually empty (or seem to have few elements) and while This is, of course, very scrappy "benchmarking" (if one can even call it that) and no testing apart from the output diff. So take with many grains of salt, but I think the main takeaway could still be that red-black-tree based containers (e.g. std::map/std::set) allocate in the default constructor and for few elements it might be worth using a simpler container. Similar optimizations might apply in other locations of the code. The execution times were checked as followsecho "wasm-opt:"
time ./build-ninja/bin/wasm-opt --strip-dwarf --post-emscripten -O3 --low-memory-unused --zero-filled-memory --strip-debug --strip-producers WonderlandRuntime-loader-physx.wasm -o WonderlandRuntime-loader-physx.wasmstripped --mvp-features --enable-mutable-globals
mv WonderlandRuntime-loader-physx.wasmstripped output-wasm-opt.wasmstripped
echo "wasm-opt-fake_set:"
time ./wasm-opt-fake_set --strip-dwarf --post-emscripten -O3 --low-memory-unused --zero-filled-memory --strip-debug --strip-producers WonderlandRuntime-loader-physx.wasm -o WonderlandRuntime-loader-physx.wasmstripped --mvp-features --enable-mutable-globals
mv WonderlandRuntime-loader-physx.wasmstripped output-wasm-opt-fake_set.wasmstripped |
Could you try small_set in |
@MaxGraey (That the heap allocation is the culprit was found through a sampling profiler (i.e. "Superluminal"), it's not the count() or insert() functions that are taking the time here. In most cases the set is empty. The difference for |
Hmm. in this case I guess make sense use |
@MaxGraey Definitely, that would probably also improve other code that already uses small_set 👍 There also was a bunch of heap allocations around Is the change for small_set and use of small_set in EffectsAnalyzer something that would be accepted as a PR? (@kripken?) |
Also instead of |
@Squareys The MSVC STL does an allocation in the default constructor of It might be worth measuring your code on another STL, just as a comparison point. But, even if it just speeds up MSVC that might be enough of a reason to do this - 25%+ is a lot! As for how to do this, I am a little worried about the fake-set approach since it has linear behavior on large sizes. Those large sizes are not happening on your testcase I guess, but in general they are possible. Some possible options:
|
@kripken Yes, stack trace is as follows with links to he Microsoft STL repository:
I honestly have trouble reading the style of code there, though, so please point out in case I'm misinterpreting this here. I'll attempt and contribute the SmallSet change, should be fairly straight forward. An alternative could be a custom allocator that stores the first allocated value as a member and copies that into the first larger allocation, but that's really complex compared to the SmallSet. |
I found another interesting bottleneck where all threads seem to be blocked by I/O in one of the threads, a millisecond or more at a time, alternating between all threads saturated and this entirely blocked state: Interestingly, the "I/O" here, is I'll try to apply this change and let you know how that goes. |
Yeah, exceptions is definitely caused perf penalties. See this comment |
Replacing the trap() (that throws the except) with |
I think that's a good idea. One thing that will need to be done though is to handle it in the fuzzer, I think: we need to log such traps (so we can compare them). But I think that might be a small addition in The LocalGraph change sounds very interesting as well! I'd need to read the PR carefully but I think it can work. |
Amazing, will have time create those PRs tomorrow evening CEST then! |
While not throwing an exception saves a few milliseconds and makes the profiles look much nicer, it's not really significant compared to the 16000 - 20000 milliseconds the entire wasm-opt pass is taking. I wonder if it's worth risking potentially breaking something here. I'll create a PR, but feel free to discard it, if you see any risk. |
And the preallocation while in theory should be an improvement, is not significant enough to escape the variance in the timing results 🤔 |
After some more hours of investigation, I found another thing: That's around Printing the sizes of the sets and maps in the destructor shows that they are always empty? I saw some uses in the code though, so I'm not sure what's going on here. They might just never be hit for this binary. Anyway, the solution to replace the Microsoft STL seems more and more useful to me 😅 |
In #4261 the idea came up to use a more efficient hashmap, which could be smaller and faster than the default stdlib one that we use atm. That would be great to look into. |
#6042 has some notes on LocalGraph improvements that could help a particularly large testcase. |
Looking at profiles it seems that allocations on the main traversal stack may be significant, Line 357 in f868137
And other structures in that file too, just because they are the core of all of our walks. Perhaps we can arena-allocate those stacks and keep reusing them. |
This makes the pass 2-3% faster in some measurements I did locally. Noticed when profiling for WebAssembly#5561 (comment) Helps WebAssembly#4165
Over the last month I focused on compile times, and as a result I see an over 2x speedup on Some details (in reverse chronological order:
|
@kripken Thanks for these! I am going to get some numbers on where my wasm file stands before (3.1.65) and after (3.1.68). using BINARYEN_PASS_DEBUG=1 One thing i am noticing is that the validation steps take a considerable amount of time but there doesn't appear to be a way to show the timings that those take. Has there been any optimizing of the validation code or anything planned? i am disabling validation here 3.1.65``` ++ ls -l 3.1.65.wasm -rw-r--r-- 1 root root 99147167 Oct 1 08:52 3.1.65.wasm ++ BINARYEN_PASS_DEBUG=1 ++ time /root/emsdk/upstream/bin/wasm-opt --no-validation --strip-target-features --post-emscripten -Os --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --no-stack-ir 3.1.65.wasm -o 3.1.65after.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-multivalue --enable-mutable-globals --enable-reference-types --enable-sign-ext [PassRunner] running passes [PassRunner] running pass: strip-target-features... 0.000540104 seconds. [PassRunner] running pass: post-emscripten... 1.49346 seconds. [PassRunner] passes took 1.494 seconds. [PassRunner] running passes [PassRunner] running pass: duplicate-function-elimination... 3.63673 seconds. [PassRunner] running pass: remove-unused-module-elements... 1.28464 seconds. [PassRunner] running pass: memory-packing... 0.247454 seconds. [PassRunner] running pass: once-reduction... 0.184166 seconds. [PassRunner] running pass: ssa-nomerge... 11.8826 seconds. [PassRunner] running pass: dce... 3.7628 seconds. [PassRunner] running pass: remove-unused-names... 0.570026 seconds. [PassRunner] running pass: remove-unused-brs... 3.11129 seconds. [PassRunner] running pass: remove-unused-names... 0.526151 seconds. [PassRunner] running pass: optimize-instructions... 2.87017 seconds. [PassRunner] running pass: pick-load-signs... 0.740516 seconds. [PassRunner] running pass: precompute... 1.71788 seconds. [PassRunner] running pass: optimize-added-constants-propagate... 23.1096 seconds. [PassRunner] running pass: code-pushing... 1.90392 seconds. [PassRunner] running pass: tuple-optimization... 0.025315 seconds. [PassRunner] running pass: simplify-locals-nostructure... 5.94543 seconds. [PassRunner] running pass: vacuum... 2.0397 seconds. [PassRunner] running pass: reorder-locals... 0.771755 seconds. [PassRunner] running pass: remove-unused-brs... 2.13625 seconds. [PassRunner] running pass: coalesce-locals... 5.68479 seconds. [PassRunner] running pass: local-cse... 3.06765 seconds. [PassRunner] running pass: simplify-locals... 7.11051 seconds. [PassRunner] running pass: vacuum... 1.78366 seconds. [PassRunner] running pass: reorder-locals... 0.745307 seconds. [PassRunner] running pass: coalesce-locals... 9.89549 seconds. [PassRunner] running pass: reorder-locals... 0.72067 seconds. [PassRunner] running pass: vacuum... 1.67271 seconds. [PassRunner] running pass: code-folding... 4.17729 seconds. [PassRunner] running pass: merge-blocks... 1.37191 seconds. [PassRunner] running pass: remove-unused-brs... 2.49119 seconds. [PassRunner] running pass: remove-unused-names... 0.520844 seconds. [PassRunner] running pass: merge-blocks... 1.26686 seconds. [PassRunner] running pass: precompute... 1.63496 seconds. [PassRunner] running pass: optimize-instructions... 2.50066 seconds. [PassRunner] running pass: rse... 6.76445 seconds. [PassRunner] running pass: vacuum... 1.80028 seconds. [PassRunner] running pass: dae-optimizing... 23.9783 seconds. [PassRunner] running pass: inlining-optimizing... 20.7845 seconds. [PassRunner] running pass: duplicate-function-elimination... 0.787577 seconds. [PassRunner] running pass: duplicate-import-elimination... 0.0406488 seconds. [PassRunner] running pass: simplify-globals-optimizing... 0.243489 seconds. [PassRunner] running pass: remove-unused-module-elements... 1.02584 seconds. [PassRunner] running pass: reorder-globals... 0.00628154 seconds. [PassRunner] running pass: directize... 0.145466 seconds. [PassRunner] passes took 166.688 seconds. [PassRunner] running passes [PassRunner] passes took 0 seconds. 459.78user 0.97system 2:53.95elapsed 264%CPU (0avgtext+0avgdata 2059088maxresident)k 0inputs+82992outputs (0major+444664minor)pagefaults 0swaps ++ ls -l 3.1.65after.wasm -rw-r--r-- 1 root root 42489720 Oct 1 09:10 3.1.65after.wasm ```3.1.68``` ++ ls -l 3.1.68.wasm -rwxr-xr-x 1 root root 99154714 Oct 1 09:13 3.1.68.wasm ++ BINARYEN_PASS_DEBUG=1 ++ time /root/emsdk/upstream/bin/wasm-opt --no-validation --strip-target-features --post-emscripten -Os --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --no-stack-ir 3.1.68.wasm -o 3.1.68after.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-multivalue --enable-mutable-globals --enable-reference-types --enable-sign-ext [PassRunner] running passes [PassRunner] running pass: strip-target-features... 0.000651804 seconds. [PassRunner] running pass: post-emscripten... 1.49796 seconds. [PassRunner] passes took 1.49861 seconds. [PassRunner] running passes [PassRunner] running pass: duplicate-function-elimination... 3.2706 seconds. [PassRunner] running pass: remove-unused-module-elements... 1.13238 seconds. [PassRunner] running pass: memory-packing... 0.256579 seconds. [PassRunner] running pass: once-reduction... 0.200214 seconds. [PassRunner] running pass: ssa-nomerge... 11.7251 seconds. [PassRunner] running pass: dce... 3.35628 seconds. [PassRunner] running pass: remove-unused-names... 0.535297 seconds. [PassRunner] running pass: remove-unused-brs... 3.16159 seconds. [PassRunner] running pass: remove-unused-names... 0.549171 seconds. [PassRunner] running pass: optimize-instructions... 2.80716 seconds. [PassRunner] running pass: pick-load-signs... 0.748634 seconds. [PassRunner] running pass: precompute... 1.60599 seconds. [PassRunner] running pass: optimize-added-constants-propagate... 15.8 seconds. [PassRunner] running pass: code-pushing... 1.85391 seconds. [PassRunner] running pass: tuple-optimization... 0.0240201 seconds. [PassRunner] running pass: simplify-locals-nostructure... 5.95946 seconds. [PassRunner] running pass: vacuum... 1.88777 seconds. [PassRunner] running pass: reorder-locals... 0.847257 seconds. [PassRunner] running pass: remove-unused-brs... 2.23691 seconds. [PassRunner] running pass: coalesce-locals... 10.0595 seconds. [PassRunner] running pass: local-cse... 3.32373 seconds. [PassRunner] running pass: simplify-locals... 7.99739 seconds. [PassRunner] running pass: vacuum... 1.9047 seconds. [PassRunner] running pass: reorder-locals... 0.765433 seconds. [PassRunner] running pass: coalesce-locals... 9.59543 seconds. [PassRunner] running pass: reorder-locals... 0.797222 seconds. [PassRunner] running pass: vacuum... 2.02627 seconds. [PassRunner] running pass: code-folding... 4.30972 seconds. [PassRunner] running pass: merge-blocks... 1.46414 seconds. [PassRunner] running pass: remove-unused-brs... 2.57263 seconds. [PassRunner] running pass: remove-unused-names... 0.532943 seconds. [PassRunner] running pass: merge-blocks... 1.25059 seconds. [PassRunner] running pass: precompute... 1.57593 seconds. [PassRunner] running pass: optimize-instructions... 2.57699 seconds. [PassRunner] running pass: rse... 6.46657 seconds. [PassRunner] running pass: vacuum... 1.78285 seconds. [PassRunner] running pass: dae-optimizing... 26.9038 seconds. [PassRunner] running pass: inlining-optimizing... 12.3328 seconds. [PassRunner] running pass: duplicate-function-elimination... 0.716017 seconds. [PassRunner] running pass: duplicate-import-elimination... 0.0248484 seconds. [PassRunner] running pass: simplify-globals-optimizing... 0.243064 seconds. [PassRunner] running pass: remove-unused-module-elements... 0.98064 seconds. [PassRunner] running pass: reorder-globals... 0.00607164 seconds. [PassRunner] running pass: directize... 0.128283 seconds. [PassRunner] passes took 158.296 seconds. [PassRunner] running passes [PassRunner] passes took 0 seconds. 370.26user 1.29system 2:44.84elapsed 225%CPU (0avgtext+0avgdata 2037692maxresident)k 0inputs+82992outputs (1major+427131minor)pagefaults 0swaps ++ ls -l 3.1.68after.wasm -rw-r--r-- 1 root root 42487858 Oct 1 09:16 3.1.68after.wasm ```I noticed |
I could be mislead here, that validation might only run during |
@arsnyder16 Yes, the validation times can be ignored in Looks like overall there is a slight improvement, but it might be worth measuring more to reduce the noise. I do Specifically I see some passes improved as I'd expect (inlining, optimize-added-constants) but it is odd that coalesce-locals regressed. Perhaps this is noise though? |
I did some investigation on the speed of compilation. This is getting important because of things like j2cl output which is a 50MB real-world file.
Slowest passes: (in
-O3
; note that some passes run more than once)perf
reports almost 15% of time is spent in malloc/free methods - quite a lot. I counted how many mallocs (note: not the size of them) are done in each pass, here are the heaviest (note: timings here include the overhead of counting mallocs atomically among threads, so those matter less):Help in speeding things up would be very welcome!
The text was updated successfully, but these errors were encountered: