Make compilation faster 🚀 #4165

kripken · 2021-09-17T23:36:57Z

I did some investigation on the speed of compilation. This is getting important because of things like j2cl output which is a 50MB real-world file.

Slowest passes: (in -O3; note that some passes run more than once)

[PassRunner]   running pass: precompute-propagate...           14.23460 seconds.
[PassRunner]   running pass: dae...                            10.91160 seconds.
[PassRunner]   running pass: heap2local...                     10.44350 seconds.
[PassRunner]   running pass: precompute-propagate...            9.07261 seconds.
[PassRunner]   running pass: inlining...                        8.02837 seconds.
[PassRunner]   running pass: once-reduction...                  8.02484 seconds.
[PassRunner]   running pass: vacuum...                          5.16162 seconds.
[PassRunner]   running pass: vacuum...                          3.64088 seconds.
[PassRunner]   running pass: vacuum...                          3.58191 seconds.
[PassRunner]   running pass: vacuum...                          3.61897 seconds.

perf reports almost 15% of time is spent in malloc/free methods - quite a lot. I counted how many mallocs (note: not the size of them) are done in each pass, here are the heaviest (note: timings here include the overhead of counting mallocs atomically among threads, so those matter less):

    once-reduction...                 9.51466 seconds.  141558196 mallocs. 
    precompute-propagate...           14.7426 seconds.   61611533 mallocs. 
    dae...                            11.3789 seconds.   53792948 mallocs. 
    precompute-propagate...           9.69919 seconds.   39531493 mallocs. 
    heap2local...                     10.5116 seconds.   38582302 mallocs. 
    local-cse...                      4.70765 seconds.   33454468 mallocs. 
    cfp...                            2.43619 seconds.   32292682 mallocs. 
    ssa-nomerge...                    4.92099 seconds.   26378117 mallocs. 
    simplify-locals-nostructure...    6.03281 seconds.   22663794 mallocs. 
    vacuum...                         5.10813 seconds.   23880789 mallocs. 
    inlining...                       107.781 seconds.   20940365 mallocs. 
    dce...                            3.84722 seconds.   20025273 mallocs. 
    local-subtyping...                3.44374 seconds.   19067529 mallocs. 

and vacuum runs 3 more times:

    vacuum...                         3.71648 seconds.   16648663 mallocs. 
    vacuum...                         3.86917 seconds.   16302767 mallocs. 
    vacuum...                         3.89855 seconds.   16162915 mallocs.

Help in speeding things up would be very welcome!

The text was updated successfully, but these errors were encountered:

MaxGraey · 2021-09-18T06:26:15Z

I think precompute-propagate could be speedup also if interpreter will avoid exceptions: #2917 (comment) and uses monadic Expect / Result instead

MaxGraey · 2021-09-18T17:03:37Z

Another variant is rewrite simple constant propagation algorithm to Sparse Conditional Constant Propagation (SCCP pass in LLVM) which should be mush faster

kripken · 2021-09-19T17:40:15Z

@MaxGraey Good idea about precompute-propagate. I checked that now. Running precompute instead (which still uses exceptions, but avoids the propagation using LocalGraph) is just 2-3 seconds. So looks like exceptions are not a significant factor there.

isSSA is not called anywhere. See #4165

Precompute has a mode in which it propagates results from local.sets to local.gets. That constructs a LocalGraph which is a non-trivial amount of work. We used to run multiple iterations of this, but investigation shows that such opportunities are extremely rare, as doing just a single propagation iteration has no effect on the entire emscripten benchmark suite, nor on j2cl output. Furthermore, we run this pass twice in the normal pipeline (once early, once late) so even if there are such opportunities they may be optimized already. And, --converge is a way to get additional iterations of all passes if a user wants that, so it makes sense not to costly work for more iterations automatically. In effect, 99.99% of the time before this pass we would create the LocalGraph twice: once the first time, then a second time only to see that we can't actually optimize anything further. This PR makes us only create it once, which makes precompute-propagate 10% faster on j2cl and even faster on other things like poppler (33%) and LLVM (29%). See the change in the test suite for an example of a case that does require more than one iteration to be optimized. Note that even there, we only manage to get benefit from a second iteration by doing something that overlaps with another pass (optimizing out an if with condition 0), which shows even more how unnecessary the extra work was. See #4165

By mistake the recent partial inlining work introduced quadratic time into the compiler: erasing a function from the list of functions takes linear time, which is why we have removeFunctions that does a group at a time. This isn't noticeable on small programs, but on j2cl output this makes the inlining-optimizing step 2x faster. See #4165

andrewevstyukhin · 2021-09-28T20:02:36Z

Hi there.

I have some statistics for 500MB wasm under Windows 10 (sadly 2 times slower than Linux).
-Os -flto -fomit-frame-pointer

[PassRunner] running pass: duplicate-function-elimination... 32.0065 seconds.
[PassRunner] running pass: simplify-locals-nostructure... 31.395 seconds.
[PassRunner] running pass: simplify-locals... 34.4849 seconds.
[PassRunner] running pass: dae-optimizing... 33.4361 seconds.
[PassRunner] running pass: inlining-optimizing... 48.7857 seconds.
[PassRunner] running pass: optimize-stack-ir... 27.0179 seconds.
[PassRunner] running pass: asyncify... 1022.96 seconds.
[PassRunner] passes took 1428.02 seconds.

Most passes run as single threaded...

-flto=thin occasionally breaks execution

wasm-ld: warning: function signature mismatch: ...
defined as () -> void in ...
defined as (i32, i32, i32, i32, i32) -> void in lto.tmp

-Os -flto=thin -fomit-frame-pointer

[PassRunner] running pass: duplicate-function-elimination... 32.5 seconds.
[PassRunner] running pass: simplify-locals-nostructure... 37.1433 seconds.
[PassRunner] running pass: simplify-locals... 42.461 seconds.
[PassRunner] running pass: dae-optimizing... 116.364 seconds.
[PassRunner] running pass: inlining-optimizing... 67.1076 seconds.
[PassRunner] running pass: optimize-stack-ir... 32.2354 seconds.
[PassRunner] running pass: asyncify... 1547.16 seconds.
[PassRunner] passes took 2106.44 seconds.

kripken · 2021-09-28T22:55:52Z

Thanks for the data @andrewevstyukhin , looks like by far the largest factor is asyncify:

[PassRunner] running pass: asyncify... 1022.96 seconds.

I think that is because it flattens the IR, which is very costly by itself, does the transform, and then optimizes after to unflatten it. Asyncify will be replaced by stack switching eventually, so this will get resolved that way. But it could be improved a lot meanwhile, I think - currently Flatten generates many more locals than it needs to.

andrewevstyukhin · 2021-09-29T07:47:26Z

Yes, asyncify is heavy. Also it doubles wasm size but not so large in entropy:

1.39.7: wasm-ld 45 357 52 (8 541 334 in 7z) -> wasm-opt 67 428 723 (8 579 733 in 7z)
2.0.14: wasm-ld 44 989 373 (8 374 314 in 7z) -> wasm-opt 69 631 668 (8 717 509 in 7z)

Best time is 481.723 seconds (asyncify 185.695 seconds, 39%).

C++ assertions are really hard for recent versions in "release" builds:
1.39.7: passes took 627.178 seconds (running pass: asyncify... 238.017 seconds)
2.0.14: passes took 1428.02 seconds (running pass: asyncify... 1022.96 seconds)

andrewevstyukhin · 2021-10-01T08:55:04Z

asyncify: I wonder can we make an attribute to mark functions in C++ code additionally to ASYNCIFY_ADD?
To not write extern "C" and be more handy than manual ASYNCIFY_ADVISE.

[asyncify] invoke_iiii is an import that can change the state
[asyncify] std::__2::__function::__value_func<void\20\28bool\29>::~__value_func\28\29 can change the state due to initial scan
[asyncify] bool\20boost::spirit::karma::string_inserterboost::spirit::unused_type\2c\20boost::spirit::unused_type::call<boost::spirit::karma::detail::output_iterator<std::__2::back_insert_iterator<std::__2::basic_string<char\2c\20std::__2::char_traits\2c\20std::2::allocator\20>\20>\2c\20mpl::int<0>\2c\20boost::spirit::unused_type>\2c\20char\20const*>\28boost::spirit::karma::detail::output_iterator<std::__2::back_insert_iterator<std::__2::basic_string<char\2c\20std::__2::char_traits\2c\20std::2::allocator\20>\20>\2c\20mpl::int<0>\2c\20boost::spirit::unused_type>&\2c\20char\20const*\29 can change the state due to

I can try to make an empty function with EMSCRIPTEN_KEEPALIVE and include function in ASYNCIFY_ADD then add calls...

Found that -s ASYNCIFY_ONLY=['*'] breaks execution at legalstub$dynCall_iij

kripken · 2021-10-06T20:03:38Z

@andrewevstyukhin I know there has been some discussion on the LLVM side about adding a way to annotate things in the source so that they show up in a wasm custom section. We could use that for asyncify, and also things like noinline. But I don't think anyone is working on it atm. I don't follow the LLVM side super-closely, though.

andrewevstyukhin · 2021-10-06T22:24:34Z

I'm in doubt about annotations nowadays... Last week I did good enough settings just using ASYNCIFY_REMOVE + ASYNCIFY_ADD in 2.0.14. What can be helpful from my sight:

ASYNCIFY_ONLY and previous ASYNCIFY_WHITELIST don't work with big codebase. Simple setting * breaks calls.
Templates<> and even spaces are impossible things to be passed in command line - maybe @ can help for settings from file.
std::function and [](){} can become configurable only with response file. They are handy things from modern C++.
Maybe multiple passes of remove & add can do precise configuration in hard cases.
ASYNCIFY_IGNORE_INDIRECT breaks big codebase. I'm still unsure why. I want to better configure dynCalls and ignore disabled emscripten_sleep in SDL2.
A robust pattern: remove _*,A*-Z*,a*-z* but d*,e*,i*,l* should stay untuned to support internal emscripten things (dynCall*, emscripten_, invoke_, legal*).

I wish to use C++20 coroutines instead of ASYNCIFY in future codebase. Asyncify consumes 40% of wasm-opt time.

kripken · 2021-10-12T00:37:06Z

@andrewevstyukhin

For the user interface stuff, maybe open a new issue on this in emscripten?

I wish to use C++20 coroutines instead of ASYNCIFY in future codebase.

That is definitely recommended - Asyncify causes a large amount of overhead, and we hope to not need it once people can use either C++20 coroutines, or a future wasm stack switching proposal.

kripken · 2022-04-27T15:25:03Z

Note: the output above, stuff like

[PassRunner]   running pass: precompute-propagate...           14.23460 seconds.

is emitted by BINARYEN_PASS_DEBUG=1 wasm-opt --no-validation. Pass-debug mode enables the timing, and disabling validation avoids it also validating after each pass (which can be slow, and is not what we're focused on in this issue).

Squareys · 2022-07-03T21:03:46Z

FWIW, I did an evening of profiling for our use case (i.e. linking the main .wasm of the Wonderland Engine runtime) and found that using the following structure instead of the 6 std::set in "EffectsAnalyzer"...

fake_set (std::vector with functions of a set)

Disclaimer: this is very ugly hacky code, just to prove a point.

```cpp
namespace std {

template<typename T>
struct fake_set : vector<T> {

    fake_set() {}

    void insert(const T& e) {
        push_back(e);
    }

    void erase(const T& e) {
        auto i = end();

        for(int i = size() - 1; i >= 0; --i) {
            auto it = begin() + i;
            if(*it == e) std::vector<T>::erase(it);
        }
    }

    size_t count(const T& k) const {
        size_t c = 0;
        for(const auto& e : *this) {
            if(e == k) ++c;
        }
        return c;
    }
};

}
```

... will yield the following almost 25-30% improvements for our use case when on Windows and built with MSVC 2019 on RelWithDebInfo config:

$ ./run-wasm-opt.sh
wasm-opt:

real    0m30.060s
user    0m0.000s
sys     0m0.015s
wasm-opt-fake_set:

real    0m21.832s
user    0m0.015s
sys     0m0.015s

Squareys@[...] /c/Repos/binaryen (main)
$ diff output-wasm-opt.wasmstripped output-wasm-opt-fake_set.wasmstripped

Squareys@[...] /c/Repos/binaryen (main)
$

Explanation: The sets are usually empty (or seem to have few elements) and while EffectsAnalyzer is always constructed in the Stack, e.g. for SimplifyLocals, std::set on the MSVC STL seems to always do a heap allocation in the default constructor, which is taking about 90% of the time in that pass.

This is, of course, very scrappy "benchmarking" (if one can even call it that) and no testing apart from the output diff. So take with many grains of salt, but I think the main takeaway could still be that red-black-tree based containers (e.g. std::map/std::set) allocate in the default constructor and for few elements it might be worth using a simpler container.

Similar optimizations might apply in other locations of the code.

The execution times were checked as follows

echo "wasm-opt:"
time ./build-ninja/bin/wasm-opt --strip-dwarf --post-emscripten -O3 --low-memory-unused --zero-filled-memory --strip-debug --strip-producers WonderlandRuntime-loader-physx.wasm -o WonderlandRuntime-loader-physx.wasmstripped --mvp-features --enable-mutable-globals
mv WonderlandRuntime-loader-physx.wasmstripped output-wasm-opt.wasmstripped

echo "wasm-opt-fake_set:"
time ./wasm-opt-fake_set --strip-dwarf --post-emscripten -O3 --low-memory-unused --zero-filled-memory --strip-debug --strip-producers WonderlandRuntime-loader-physx.wasm -o WonderlandRuntime-loader-physx.wasmstripped --mvp-features --enable-mutable-globals
mv WonderlandRuntime-loader-physx.wasmstripped output-wasm-opt-fake_set.wasmstripped

MaxGraey · 2022-07-04T05:26:23Z

Could you try small_set in EffectsAnalyzer instead of your custom fake_set?

Squareys · 2022-07-04T13:08:22Z

@MaxGraey small_set calls the default constructor of std::set, so it wouldn't remove the heap allocation 🤔

(That the heap allocation is the culprit was found through a sampling profiler (i.e. "Superluminal"), it's not the count() or insert() functions that are taking the time here. In most cases the set is empty. The difference for std::vector is, that it does not allocate in the default constructor for 0 elements).

MaxGraey · 2022-07-04T13:34:42Z

Hmm. in this case I guess make sense use small_sets instead std::sets but init/alloc they lazily (by wrapping by std::unique_ptr for example) in EffectsAnalyzer

Squareys · 2022-07-04T15:41:55Z

@MaxGraey Definitely, that would probably also improve other code that already uses small_set 👍

There also was a bunch of heap allocations around BasicBlock btw, I think the factory function there always allocates on the heap.

Is the change for small_set and use of small_set in EffectsAnalyzer something that would be accepted as a PR? (@kripken?)
Can't promise to get to it too soon, but since the change could be fairly high impact, I think it'd be great to do (happy to sponsor it otherwise, if someone's up for that)

MaxGraey · 2022-07-04T16:12:56Z

Also instead of std::set<Index> or small_set<Index> probably better use bit packed sets like std::bitset / dynamic_bitset. It should be faster. Perhaps create small_bitset

kripken · 2022-07-06T17:22:57Z

@Squareys The MSVC STL does an allocation in the default constructor of std::set? That's surprising...

It might be worth measuring your code on another STL, just as a comparison point. But, even if it just speeds up MSVC that might be enough of a reason to do this - 25%+ is a lot!

As for how to do this, I am a little worried about the fake-set approach since it has linear behavior on large sizes. Those large sizes are not happening on your testcase I guess, but in general they are possible. Some possible options:

A limited-size-fake-set, that aborts if the size grows beyond a point. That would only be usable on things that have a known maximum size, or where the optimizer knows that beyond some size it will bail out anyhow.
Wrap the std::set in SmallSet with std::optional. Then its constructor is only called when actually used. But, that might have a cost.
Avoid using the system STL. We could just bundle libc++ and get uniform behavior everywhere.

Squareys · 2022-08-06T19:51:26Z

The MSVC STL does an allocation in the default constructor of std::set? That's surprising...

@kripken Yes, stack trace is as follows with links to he Microsoft STL repository:

std::set default constructor calls _Mybase, a _Tree
_Tree(key_compare) calls _Alloc_sentinel_and_proxy()
_Alloc_sentinel_and_proxy() calls _Buyheadnode with the allocator, allocating a node

I honestly have trouble reading the style of code there, though, so please point out in case I'm misinterpreting this here.
One way or another: the allocations were found through profiling with Superluminal, so they are certainly happening.

I'll attempt and contribute the SmallSet change, should be fairly straight forward. An alternative could be a custom allocator that stores the first allocated value as a member and copies that into the first larger allocation, but that's really complex compared to the SmallSet.

Squareys · 2022-08-07T11:25:52Z

I found another interesting bottleneck where all threads seem to be blocked by I/O in one of the threads, a millisecond or more at a time, alternating between all threads saturated and this entirely blocked state:

Interestingly, the "I/O" here, is trap("unreachable"), which throws an exception and the catching seems to be causing a massive stall in the entire application. I unsure if there's a more hidden reason an exception needs to thrown here, since it's converted to a Flow(NONCONSTANT_FLOW) later anyway, returning that instead should be totally valid and sufficient as far as I can tell.

I'll try to apply this change and let you know how that goes.

MaxGraey · 2022-08-07T11:42:01Z

Yeah, exceptions is definitely caused perf penalties. See this comment

Squareys · 2022-08-07T12:29:48Z

Replacing the trap() (that throws the except) with return Flow(NONCONSTANT_FLOW) seems to work perfectly fine and removes all those stalls.

Squareys · 2022-08-07T12:49:16Z

Finally, it seems like there is a couple of chunks of 1 second + where all the threads wait for work to complete on one of the workers. Investigating if there's a way to make that faster or balance the workload better:

It looks like a significant portion of time is spent in a heap allocation in doVisitLocalGet in LocalGraph.cpp while emplacing an element in the self->locations map.

=> Allocating LocalGraph::locations in the constructor upfront (and changing it from std::map to std::unordered_map in the process, unsure if this actually valid to do) with the amount of get and sets expected seems to reduce these stalls by ~30 %. I did this by creating a CountAll analogous to FindAll and running it on the func->body for set and get in MergeLocals, passing the count to a new parameter to LocalGraph for reserving in locations.

Happy to open a PR for this and the exception change, if you determine it valuable.

kripken · 2022-08-08T16:36:53Z

Replacing the trap() (that throws the except) with return Flow(NONCONSTANT_FLOW) seems to work perfectly fine and removes all those stalls.

I think that's a good idea. One thing that will need to be done though is to handle it in the fuzzer, I think: we need to log such traps (so we can compare them). But I think that might be a small addition in execution-results.h. A PR would be great!

The LocalGraph change sounds very interesting as well! I'd need to read the PR carefully but I think it can work.

Squareys · 2022-08-08T20:56:01Z

Amazing, will have time create those PRs tomorrow evening CEST then!

Squareys · 2022-08-09T15:45:14Z

While not throwing an exception saves a few milliseconds and makes the profiles look much nicer, it's not really significant compared to the 16000 - 20000 milliseconds the entire wasm-opt pass is taking. I wonder if it's worth risking potentially breaking something here. I'll create a PR, but feel free to discard it, if you see any risk.

Squareys · 2022-08-09T16:24:02Z

And the preallocation while in theory should be an improvement, is not significant enough to escape the variance in the timing results 🤔

Squareys · 2022-08-09T17:54:11Z

After some more hours of investigation, I found another thing: Function in wasm.h is using 7 std::set and std::unordered_map, which for my case is being allocated 4799 times (potentially more often, increasing some static counter in the destructor, not sure if that's thread safe).

That's around 4799 function instantiations * 7 heap allocations * 0.15 ms/allocation = 5 seconds of cost for the default constructors alone (parallelized on multiple threads, though).

Printing the sizes of the sets and maps in the destructor shows that they are always empty? I saw some uses in the code though, so I'm not sure what's going on here. They might just never be hit for this binary.

Anyway, the solution to replace the Microsoft STL seems more and more useful to me 😅

This makes the pass 2-3% faster in some measurements I did locally. Noticed when profiling for #5561 (comment) Helps #4165

kripken · 2023-10-11T16:22:10Z

In #4261 the idea came up to use a more efficient hashmap, which could be smaller and faster than the default stdlib one that we use atm. That would be great to look into.

kripken · 2023-10-23T20:18:26Z

#6042 has some notes on LocalGraph improvements that could help a particularly large testcase.

kripken · 2024-02-27T21:04:45Z

Looking at profiles it seems that allocations on the main traversal stack may be significant,

binaryen/src/wasm-traversal.h

Line 357 in f868137

SmallVector<Task, 10> stack; // stack of tasks

And other structures in that file too, just because they are the core of all of our walks.

Perhaps we can arena-allocate those stacks and keep reusing them.

This makes the pass 2-3% faster in some measurements I did locally. Noticed when profiling for WebAssembly#5561 (comment) Helps WebAssembly#4165

kripken · 2024-09-24T23:11:42Z

Over the last month I focused on compile times, and as a result I see an over 2x speedup on wasm-opt -O3 on two large testcases I've been looking at, one C++ (Wasm MVP) and one Java (WasmGC). The improvements are pretty general so I hope this speeds up most builds.

Some details (in reverse chronological order:

Make Inlining 5x faster by avoiding repeated function-level work (ReFinalize etc.) [NFC-ish] Avoid repeated ReFinalize etc. when inlining #6967
Make Inlining 2x faster by parallelizing the key codegen part [NFC] Parallelize the actual inlining part of the Inlining pass #6966
Make precompute 33% faster on GC content by inlining core type.h calls [NFC] Move enough of wasm-type.cpp into wasm-type.h to inline core is*() methods #6936
Make OptimizeAddedConstants 20% faster by not collecting unneeded parents [NFC] Avoid collecting unnecessary parents in OptimizeAddedConstants #6953
Make OptimizeAddedConstants 15% faster using LazyLocalGraph [NFC] Add isSSA to LazyLocalGraph, and use it in OptimizeAddedConstants #6952
10% faster binary writing with StackIR optimized with LazyLocalGraph [NFC] LazyLocalGraph: Add getSetInfluences() #6909
Make LazyLocalGraph even lazier (do nothing til first request), another 10% to binary writing and 50% to LICM [NFC] Make LazyLocalGraph even lazier #6919
Avoid LocalGraph in LocalSubtyping, 2-3x faster [NFC-ish] Remove LocalGraph from LocalSubtyping #6921
Make Heap2Local 40% faster with LazyLocalGraph [NFC] Use a lazy LocalGraph in Heap2Local #6925
Make binary reading 12% faster by removing excessive logging [NFC] Remove excessive debugging from binary reading #6927
Make precompute 25% faster with a rewrite of propagateLocals [NFC] Optimize propagateLocals() #6928
Make precompute another 20% faster with LazyLocalGraph [NFC] Make Precompute use a lazy LocalGraph #6934
Make LocalGraph optionally lazy [NFC] Refactor LocalGraph's core getSets API #6877 [NFC] Refactor LocalGraph for future laziness #6880 [NFC] Convert LocalGraph influences accesses to function calls #6899 [NFC] Add a lazy mode to LocalGraph #6895
Make DAE 3x faster by avoiding repeated work [NFC] Avoid repeated work in DeadArgumentElimination scanning #6869
Make DAE 17% faster by avoiding LocalGraph [NFC] Optimize ParamUtils::getUsedParams() #6866
Make Merge-Locals 20% faster by avoiding wasted LocalGraph work [NFC] Avoid wasted LocalGraph work in MergeLocals #6908
Avoid quadratic time in precomputed Blocks: [NFC] Avoid quadratic time when precomputing blocks #6862
Avoid quadratic time in stackIR: [NFC] Avoid quadratic time in StackIROptimizer::removeUnneededBlocks() #6859

arsnyder16 · 2024-10-01T12:51:02Z

@kripken Thanks for these! I am going to get some numbers on where my wasm file stands before (3.1.65) and after (3.1.68). using BINARYEN_PASS_DEBUG=1 One thing i am noticing is that the validation steps take a considerable amount of time but there doesn't appear to be a way to show the timings that those take.

Has there been any optimizing of the validation code or anything planned?

i am disabling validation here

3.1.65

``` ++ ls -l 3.1.65.wasm -rw-r--r-- 1 root root 99147167 Oct 1 08:52 3.1.65.wasm ++ BINARYEN_PASS_DEBUG=1 ++ time /root/emsdk/upstream/bin/wasm-opt --no-validation --strip-target-features --post-emscripten -Os --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --no-stack-ir 3.1.65.wasm -o 3.1.65after.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-multivalue --enable-mutable-globals --enable-reference-types --enable-sign-ext [PassRunner] running passes [PassRunner] running pass: strip-target-features... 0.000540104 seconds. [PassRunner] running pass: post-emscripten... 1.49346 seconds. [PassRunner] passes took 1.494 seconds. [PassRunner] running passes [PassRunner] running pass: duplicate-function-elimination... 3.63673 seconds. [PassRunner] running pass: remove-unused-module-elements... 1.28464 seconds. [PassRunner] running pass: memory-packing... 0.247454 seconds. [PassRunner] running pass: once-reduction... 0.184166 seconds. [PassRunner] running pass: ssa-nomerge... 11.8826 seconds. [PassRunner] running pass: dce... 3.7628 seconds. [PassRunner] running pass: remove-unused-names... 0.570026 seconds. [PassRunner] running pass: remove-unused-brs... 3.11129 seconds. [PassRunner] running pass: remove-unused-names... 0.526151 seconds. [PassRunner] running pass: optimize-instructions... 2.87017 seconds. [PassRunner] running pass: pick-load-signs... 0.740516 seconds. [PassRunner] running pass: precompute... 1.71788 seconds. [PassRunner] running pass: optimize-added-constants-propagate... 23.1096 seconds. [PassRunner] running pass: code-pushing... 1.90392 seconds. [PassRunner] running pass: tuple-optimization... 0.025315 seconds. [PassRunner] running pass: simplify-locals-nostructure... 5.94543 seconds. [PassRunner] running pass: vacuum... 2.0397 seconds. [PassRunner] running pass: reorder-locals... 0.771755 seconds. [PassRunner] running pass: remove-unused-brs... 2.13625 seconds. [PassRunner] running pass: coalesce-locals... 5.68479 seconds. [PassRunner] running pass: local-cse... 3.06765 seconds. [PassRunner] running pass: simplify-locals... 7.11051 seconds. [PassRunner] running pass: vacuum... 1.78366 seconds. [PassRunner] running pass: reorder-locals... 0.745307 seconds. [PassRunner] running pass: coalesce-locals... 9.89549 seconds. [PassRunner] running pass: reorder-locals... 0.72067 seconds. [PassRunner] running pass: vacuum... 1.67271 seconds. [PassRunner] running pass: code-folding... 4.17729 seconds. [PassRunner] running pass: merge-blocks... 1.37191 seconds. [PassRunner] running pass: remove-unused-brs... 2.49119 seconds. [PassRunner] running pass: remove-unused-names... 0.520844 seconds. [PassRunner] running pass: merge-blocks... 1.26686 seconds. [PassRunner] running pass: precompute... 1.63496 seconds. [PassRunner] running pass: optimize-instructions... 2.50066 seconds. [PassRunner] running pass: rse... 6.76445 seconds. [PassRunner] running pass: vacuum... 1.80028 seconds. [PassRunner] running pass: dae-optimizing... 23.9783 seconds. [PassRunner] running pass: inlining-optimizing... 20.7845 seconds. [PassRunner] running pass: duplicate-function-elimination... 0.787577 seconds. [PassRunner] running pass: duplicate-import-elimination... 0.0406488 seconds. [PassRunner] running pass: simplify-globals-optimizing... 0.243489 seconds. [PassRunner] running pass: remove-unused-module-elements... 1.02584 seconds. [PassRunner] running pass: reorder-globals... 0.00628154 seconds. [PassRunner] running pass: directize... 0.145466 seconds. [PassRunner] passes took 166.688 seconds. [PassRunner] running passes [PassRunner] passes took 0 seconds. 459.78user 0.97system 2:53.95elapsed 264%CPU (0avgtext+0avgdata 2059088maxresident)k 0inputs+82992outputs (0major+444664minor)pagefaults 0swaps ++ ls -l 3.1.65after.wasm -rw-r--r-- 1 root root 42489720 Oct 1 09:10 3.1.65after.wasm ```

3.1.68

``` ++ ls -l 3.1.68.wasm -rwxr-xr-x 1 root root 99154714 Oct 1 09:13 3.1.68.wasm ++ BINARYEN_PASS_DEBUG=1 ++ time /root/emsdk/upstream/bin/wasm-opt --no-validation --strip-target-features --post-emscripten -Os --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --no-stack-ir 3.1.68.wasm -o 3.1.68after.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-multivalue --enable-mutable-globals --enable-reference-types --enable-sign-ext [PassRunner] running passes [PassRunner] running pass: strip-target-features... 0.000651804 seconds. [PassRunner] running pass: post-emscripten... 1.49796 seconds. [PassRunner] passes took 1.49861 seconds. [PassRunner] running passes [PassRunner] running pass: duplicate-function-elimination... 3.2706 seconds. [PassRunner] running pass: remove-unused-module-elements... 1.13238 seconds. [PassRunner] running pass: memory-packing... 0.256579 seconds. [PassRunner] running pass: once-reduction... 0.200214 seconds. [PassRunner] running pass: ssa-nomerge... 11.7251 seconds. [PassRunner] running pass: dce... 3.35628 seconds. [PassRunner] running pass: remove-unused-names... 0.535297 seconds. [PassRunner] running pass: remove-unused-brs... 3.16159 seconds. [PassRunner] running pass: remove-unused-names... 0.549171 seconds. [PassRunner] running pass: optimize-instructions... 2.80716 seconds. [PassRunner] running pass: pick-load-signs... 0.748634 seconds. [PassRunner] running pass: precompute... 1.60599 seconds. [PassRunner] running pass: optimize-added-constants-propagate... 15.8 seconds. [PassRunner] running pass: code-pushing... 1.85391 seconds. [PassRunner] running pass: tuple-optimization... 0.0240201 seconds. [PassRunner] running pass: simplify-locals-nostructure... 5.95946 seconds. [PassRunner] running pass: vacuum... 1.88777 seconds. [PassRunner] running pass: reorder-locals... 0.847257 seconds. [PassRunner] running pass: remove-unused-brs... 2.23691 seconds. [PassRunner] running pass: coalesce-locals... 10.0595 seconds. [PassRunner] running pass: local-cse... 3.32373 seconds. [PassRunner] running pass: simplify-locals... 7.99739 seconds. [PassRunner] running pass: vacuum... 1.9047 seconds. [PassRunner] running pass: reorder-locals... 0.765433 seconds. [PassRunner] running pass: coalesce-locals... 9.59543 seconds. [PassRunner] running pass: reorder-locals... 0.797222 seconds. [PassRunner] running pass: vacuum... 2.02627 seconds. [PassRunner] running pass: code-folding... 4.30972 seconds. [PassRunner] running pass: merge-blocks... 1.46414 seconds. [PassRunner] running pass: remove-unused-brs... 2.57263 seconds. [PassRunner] running pass: remove-unused-names... 0.532943 seconds. [PassRunner] running pass: merge-blocks... 1.25059 seconds. [PassRunner] running pass: precompute... 1.57593 seconds. [PassRunner] running pass: optimize-instructions... 2.57699 seconds. [PassRunner] running pass: rse... 6.46657 seconds. [PassRunner] running pass: vacuum... 1.78285 seconds. [PassRunner] running pass: dae-optimizing... 26.9038 seconds. [PassRunner] running pass: inlining-optimizing... 12.3328 seconds. [PassRunner] running pass: duplicate-function-elimination... 0.716017 seconds. [PassRunner] running pass: duplicate-import-elimination... 0.0248484 seconds. [PassRunner] running pass: simplify-globals-optimizing... 0.243064 seconds. [PassRunner] running pass: remove-unused-module-elements... 0.98064 seconds. [PassRunner] running pass: reorder-globals... 0.00607164 seconds. [PassRunner] running pass: directize... 0.128283 seconds. [PassRunner] passes took 158.296 seconds. [PassRunner] running passes [PassRunner] passes took 0 seconds. 370.26user 1.29system 2:44.84elapsed 225%CPU (0avgtext+0avgdata 2037692maxresident)k 0inputs+82992outputs (1major+427131minor)pagefaults 0swaps ++ ls -l 3.1.68after.wasm -rw-r--r-- 1 root root 42487858 Oct 1 09:16 3.1.68after.wasm ```

I noticed dae-optimizing might have gotten a little slower, i took a crack at this pass a few years ago but don't have time to revisit, one of these did land but caused a regression that had to be rolled back

#4623
#4641
#4629

arsnyder16 · 2024-10-01T13:38:57Z

I could be mislead here, that validation might only run during BINARYEN_PASS_DEBUG?

kripken · 2024-10-01T14:57:15Z

@arsnyder16 Yes, the validation times can be ignored in BINARYEN_PASS_DEBUG. A lot more validation is done in that debug mode - both additional costly checks, and also they are done after every pass. Note that it is still slower than normal even with --no-validation because each pass is run to completion before continuing to the next (which is bad for cache locality).

Looks like overall there is a slight improvement, but it might be worth measuring more to reduce the noise. I do perf stat -r 10 (on linux, and without BINARYEN_PASS_DEBUG) which does an average and reports the deviation.

Specifically I see some passes improved as I'd expect (inlining, optimize-added-constants) but it is odd that coalesce-locals regressed. Perhaps this is noise though?

kripken mentioned this issue Sep 22, 2021

Remove unneeded work in Precompute #4176

Merged

kripken added a commit that referenced this issue Sep 22, 2021

Remove unneeded work in Precompute (#4176)

6c1724c

isSSA is not called anywhere. See #4165

kripken mentioned this issue Sep 23, 2021

Precompute: Only run a single LocalGraph iteration #4184

Merged

kripken mentioned this issue Sep 26, 2021

Inlining: Remove unneeded functions in linear time #4190

Merged

kripken mentioned this issue Sep 30, 2021

[OptimizeInstructions] Early return for functions with single childless body #4189

Closed

kripken mentioned this issue Apr 27, 2022

wasm-opt performance #4619

Open

kripken mentioned this issue May 6, 2022

Platform-specific performance differences when running wasm-opt. #4650

Closed

Squareys mentioned this issue Aug 6, 2022

Avoid SmallSetBase heap allocation in default constructor #4884

Open

Squareys mentioned this issue Aug 9, 2022

Avoid unreachable exception #4886

Open

This was referenced Mar 15, 2023

Consider linking with mimalloc in release executables? #5561

Open

Use SmallVector in TypeUpdating::updateParamTypes() #5579

Open

Use a SmallVector in MergeBlocks [NFC] #5594

Merged

kripken added a commit that referenced this issue Mar 21, 2023

Use a SmallVector in MergeBlocks [NFC] (#5594)

39e3490

This makes the pass 2-3% faster in some measurements I did locally. Noticed when profiling for #5561 (comment) Helps #4165

kripken mentioned this issue Nov 19, 2024

Use IRBuilder in the binary parser #6963

Merged

Make compilation faster 🚀 #4165

Make compilation faster 🚀 #4165

Comments

kripken commented Sep 17, 2021

MaxGraey commented Sep 18, 2021

MaxGraey commented Sep 18, 2021

kripken commented Sep 19, 2021

andrewevstyukhin commented Sep 28, 2021

kripken commented Sep 28, 2021 • edited Loading

andrewevstyukhin commented Sep 29, 2021

andrewevstyukhin commented Oct 1, 2021 • edited Loading

kripken commented Oct 6, 2021

andrewevstyukhin commented Oct 6, 2021

kripken commented Oct 12, 2021

kripken commented Apr 27, 2022

Squareys commented Jul 3, 2022 • edited Loading

MaxGraey commented Jul 4, 2022 • edited Loading

Squareys commented Jul 4, 2022 • edited Loading

MaxGraey commented Jul 4, 2022 • edited Loading

Squareys commented Jul 4, 2022

MaxGraey commented Jul 4, 2022 • edited Loading

kripken commented Jul 6, 2022

Squareys commented Aug 6, 2022

Squareys commented Aug 7, 2022

MaxGraey commented Aug 7, 2022

Squareys commented Aug 7, 2022

Squareys commented Aug 7, 2022 • edited Loading

kripken commented Aug 8, 2022

Squareys commented Aug 8, 2022

Squareys commented Aug 9, 2022

Squareys commented Aug 9, 2022

Squareys commented Aug 9, 2022

kripken commented Oct 11, 2023

kripken commented Oct 23, 2023

kripken commented Feb 27, 2024

kripken commented Sep 24, 2024

arsnyder16 commented Oct 1, 2024 • edited Loading

arsnyder16 commented Oct 1, 2024

kripken commented Oct 1, 2024

kripken commented Sep 28, 2021 •

edited

Loading

andrewevstyukhin commented Oct 1, 2021 •

edited

Loading

Squareys commented Jul 3, 2022 •

edited

Loading

MaxGraey commented Jul 4, 2022 •

edited

Loading

Squareys commented Jul 4, 2022 •

edited

Loading

MaxGraey commented Jul 4, 2022 •

edited

Loading

MaxGraey commented Jul 4, 2022 •

edited

Loading

Squareys commented Aug 7, 2022 •

edited

Loading

arsnyder16 commented Oct 1, 2024 •

edited

Loading