Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make compilation faster 🚀 #4165

Open
kripken opened this issue Sep 17, 2021 · 35 comments
Open

Make compilation faster 🚀 #4165

kripken opened this issue Sep 17, 2021 · 35 comments

Comments

@kripken
Copy link
Member

kripken commented Sep 17, 2021

I did some investigation on the speed of compilation. This is getting important because of things like j2cl output which is a 50MB real-world file.

Slowest passes: (in -O3; note that some passes run more than once)

[PassRunner]   running pass: precompute-propagate...           14.23460 seconds.
[PassRunner]   running pass: dae...                            10.91160 seconds.
[PassRunner]   running pass: heap2local...                     10.44350 seconds.
[PassRunner]   running pass: precompute-propagate...            9.07261 seconds.
[PassRunner]   running pass: inlining...                        8.02837 seconds.
[PassRunner]   running pass: once-reduction...                  8.02484 seconds.
[PassRunner]   running pass: vacuum...                          5.16162 seconds.
[PassRunner]   running pass: vacuum...                          3.64088 seconds.
[PassRunner]   running pass: vacuum...                          3.58191 seconds.
[PassRunner]   running pass: vacuum...                          3.61897 seconds.

perf reports almost 15% of time is spent in malloc/free methods - quite a lot. I counted how many mallocs (note: not the size of them) are done in each pass, here are the heaviest (note: timings here include the overhead of counting mallocs atomically among threads, so those matter less):

    once-reduction...                 9.51466 seconds.  141558196 mallocs. 
    precompute-propagate...           14.7426 seconds.   61611533 mallocs. 
    dae...                            11.3789 seconds.   53792948 mallocs. 
    precompute-propagate...           9.69919 seconds.   39531493 mallocs. 
    heap2local...                     10.5116 seconds.   38582302 mallocs. 
    local-cse...                      4.70765 seconds.   33454468 mallocs. 
    cfp...                            2.43619 seconds.   32292682 mallocs. 
    ssa-nomerge...                    4.92099 seconds.   26378117 mallocs. 
    simplify-locals-nostructure...    6.03281 seconds.   22663794 mallocs. 
    vacuum...                         5.10813 seconds.   23880789 mallocs. 
    inlining...                       107.781 seconds.   20940365 mallocs. 
    dce...                            3.84722 seconds.   20025273 mallocs. 
    local-subtyping...                3.44374 seconds.   19067529 mallocs. 

and vacuum runs 3 more times:

    vacuum...                         3.71648 seconds.   16648663 mallocs. 
    vacuum...                         3.86917 seconds.   16302767 mallocs. 
    vacuum...                         3.89855 seconds.   16162915 mallocs. 

Help in speeding things up would be very welcome!

@MaxGraey
Copy link
Contributor

I think precompute-propagate could be speedup also if interpreter will avoid exceptions: #2917 (comment) and uses monadic Expect / Result instead

@MaxGraey
Copy link
Contributor

Another variant is rewrite simple constant propagation algorithm to Sparse Conditional Constant Propagation (SCCP pass in LLVM) which should be mush faster

@kripken
Copy link
Member Author

kripken commented Sep 19, 2021

@MaxGraey Good idea about precompute-propagate. I checked that now. Running precompute instead (which still uses exceptions, but avoids the propagation using LocalGraph) is just 2-3 seconds. So looks like exceptions are not a significant factor there.

kripken added a commit that referenced this issue Sep 22, 2021
isSSA is not called anywhere.

See #4165
kripken added a commit that referenced this issue Sep 24, 2021
Precompute has a mode in which it propagates results from local.sets to
local.gets. That constructs a LocalGraph which is a non-trivial amount of
work. We used to run multiple iterations of this, but investigation shows that
such opportunities are extremely rare, as doing just a single propagation
iteration has no effect on the entire emscripten benchmark suite, nor on
j2cl output. Furthermore, we run this pass twice in the normal pipeline (once
early, once late) so even if there are such opportunities they may be
optimized already. And, --converge is a way to get additional iterations of
all passes if a user wants that, so it makes sense not to costly work for more
iterations automatically.

In effect, 99.99% of the time before this pass we would create the LocalGraph
twice: once the first time, then a second time only to see that we can't
actually optimize anything further. This PR makes us only create it once, which
makes precompute-propagate 10% faster on j2cl and even faster on other things
like poppler (33%) and LLVM (29%).

See the change in the test suite for an example of a case that does require
more than one iteration to be optimized. Note that even there, we only manage
to get benefit from a second iteration by doing something that overlaps with
another pass (optimizing out an if with condition 0), which shows even more
how unnecessary the extra work was.

See #4165
kripken added a commit that referenced this issue Sep 27, 2021
By mistake the recent partial inlining work introduced quadratic time into
the compiler: erasing a function from the list of functions takes linear time,
which is why we have removeFunctions that does a group at a time.

This isn't noticeable on small programs, but on j2cl output this makes the
inlining-optimizing step 2x faster.

See #4165
@andrewevstyukhin
Copy link

Hi there.

I have some statistics for 500MB wasm under Windows 10 (sadly 2 times slower than Linux).
-Os -flto -fomit-frame-pointer

[PassRunner] running pass: duplicate-function-elimination... 32.0065 seconds.
[PassRunner] running pass: simplify-locals-nostructure... 31.395 seconds.
[PassRunner] running pass: simplify-locals... 34.4849 seconds.
[PassRunner] running pass: dae-optimizing... 33.4361 seconds.
[PassRunner] running pass: inlining-optimizing... 48.7857 seconds.
[PassRunner] running pass: optimize-stack-ir... 27.0179 seconds.
[PassRunner] running pass: asyncify... 1022.96 seconds.
[PassRunner] passes took 1428.02 seconds.

Most passes run as single threaded...

-flto=thin occasionally breaks execution

wasm-ld: warning: function signature mismatch: ...
defined as () -> void in ...
defined as (i32, i32, i32, i32, i32) -> void in lto.tmp

-Os -flto=thin -fomit-frame-pointer

[PassRunner] running pass: duplicate-function-elimination... 32.5 seconds.
[PassRunner] running pass: simplify-locals-nostructure... 37.1433 seconds.
[PassRunner] running pass: simplify-locals... 42.461 seconds.
[PassRunner] running pass: dae-optimizing... 116.364 seconds.
[PassRunner] running pass: inlining-optimizing... 67.1076 seconds.
[PassRunner] running pass: optimize-stack-ir... 32.2354 seconds.
[PassRunner] running pass: asyncify... 1547.16 seconds.
[PassRunner] passes took 2106.44 seconds.

@kripken
Copy link
Member Author

kripken commented Sep 28, 2021

Thanks for the data @andrewevstyukhin , looks like by far the largest factor is asyncify:

[PassRunner] running pass: asyncify... 1022.96 seconds.

I think that is because it flattens the IR, which is very costly by itself, does the transform, and then optimizes after to unflatten it. Asyncify will be replaced by stack switching eventually, so this will get resolved that way. But it could be improved a lot meanwhile, I think - currently Flatten generates many more locals than it needs to.

@andrewevstyukhin
Copy link

Yes, asyncify is heavy. Also it doubles wasm size but not so large in entropy:

1.39.7: wasm-ld 45 357 52 (8 541 334 in 7z) -> wasm-opt 67 428 723 (8 579 733 in 7z)
2.0.14: wasm-ld 44 989 373 (8 374 314 in 7z) -> wasm-opt 69 631 668 (8 717 509 in 7z)

Best time is 481.723 seconds (asyncify 185.695 seconds, 39%).

C++ assertions are really hard for recent versions in "release" builds:
1.39.7: passes took 627.178 seconds (running pass: asyncify... 238.017 seconds)
2.0.14: passes took 1428.02 seconds (running pass: asyncify... 1022.96 seconds)

@andrewevstyukhin
Copy link

andrewevstyukhin commented Oct 1, 2021

asyncify: I wonder can we make an attribute to mark functions in C++ code additionally to ASYNCIFY_ADD?
To not write extern "C" and be more handy than manual ASYNCIFY_ADVISE.

[asyncify] invoke_iiii is an import that can change the state
[asyncify] std::__2::__function::__value_func<void\20\28bool\29>::~__value_func\28\29 can change the state due to initial scan
[asyncify] bool\20boost::spirit::karma::string_inserterboost::spirit::unused_type\2c\20boost::spirit::unused_type::call<boost::spirit::karma::detail::output_iterator<std::__2::back_insert_iterator<std::__2::basic_string<char\2c\20std::__2::char_traits\2c\20std::2::allocator\20>\20>\2c\20mpl::int<0>\2c\20boost::spirit::unused_type>\2c\20char\20const*>\28boost::spirit::karma::detail::output_iterator<std::__2::back_insert_iterator<std::__2::basic_string<char\2c\20std::__2::char_traits\2c\20std::2::allocator\20>\20>\2c\20mpl::int<0>\2c\20boost::spirit::unused_type>&\2c\20char\20const*\29 can change the state due to

I can try to make an empty function with EMSCRIPTEN_KEEPALIVE and include function in ASYNCIFY_ADD then add calls...

Found that -s ASYNCIFY_ONLY=['*'] breaks execution at legalstub$dynCall_iij

@kripken
Copy link
Member Author

kripken commented Oct 6, 2021

@andrewevstyukhin I know there has been some discussion on the LLVM side about adding a way to annotate things in the source so that they show up in a wasm custom section. We could use that for asyncify, and also things like noinline. But I don't think anyone is working on it atm. I don't follow the LLVM side super-closely, though.

@andrewevstyukhin
Copy link

I'm in doubt about annotations nowadays... Last week I did good enough settings just using ASYNCIFY_REMOVE + ASYNCIFY_ADD in 2.0.14. What can be helpful from my sight:

  1. ASYNCIFY_ONLY and previous ASYNCIFY_WHITELIST don't work with big codebase. Simple setting * breaks calls.
  2. Templates<> and even spaces are impossible things to be passed in command line - maybe @ can help for settings from file.
  3. std::function and [](){} can become configurable only with response file. They are handy things from modern C++.
  4. Maybe multiple passes of remove & add can do precise configuration in hard cases.
  5. ASYNCIFY_IGNORE_INDIRECT breaks big codebase. I'm still unsure why. I want to better configure dynCalls and ignore disabled emscripten_sleep in SDL2.
  6. A robust pattern: remove _*,A*-Z*,a*-z* but d*,e*,i*,l* should stay untuned to support internal emscripten things (dynCall*, emscripten_, invoke_, legal*).

I wish to use C++20 coroutines instead of ASYNCIFY in future codebase. Asyncify consumes 40% of wasm-opt time.

@kripken
Copy link
Member Author

kripken commented Oct 12, 2021

@andrewevstyukhin

For the user interface stuff, maybe open a new issue on this in emscripten?

I wish to use C++20 coroutines instead of ASYNCIFY in future codebase.

That is definitely recommended - Asyncify causes a large amount of overhead, and we hope to not need it once people can use either C++20 coroutines, or a future wasm stack switching proposal.

@kripken
Copy link
Member Author

kripken commented Apr 27, 2022

Note: the output above, stuff like

[PassRunner]   running pass: precompute-propagate...           14.23460 seconds.

is emitted by BINARYEN_PASS_DEBUG=1 wasm-opt --no-validation. Pass-debug mode enables the timing, and disabling validation avoids it also validating after each pass (which can be slow, and is not what we're focused on in this issue).

@Squareys
Copy link

Squareys commented Jul 3, 2022

FWIW, I did an evening of profiling for our use case (i.e. linking the main .wasm of the Wonderland Engine runtime) and found that using the following structure instead of the 6 std::set in "EffectsAnalyzer"...

fake_set (std::vector with functions of a set)

Disclaimer: this is very ugly hacky code, just to prove a point.

```cpp
namespace std {

template<typename T>
struct fake_set : vector<T> {

    fake_set() {}

    void insert(const T& e) {
        push_back(e);
    }

    void erase(const T& e) {
        auto i = end();

        for(int i = size() - 1; i >= 0; --i) {
            auto it = begin() + i;
            if(*it == e) std::vector<T>::erase(it);
        }
    }

    size_t count(const T& k) const {
        size_t c = 0;
        for(const auto& e : *this) {
            if(e == k) ++c;
        }
        return c;
    }
};

}
```

... will yield the following almost 25-30% improvements for our use case when on Windows and built with MSVC 2019 on RelWithDebInfo config:

$ ./run-wasm-opt.sh
wasm-opt:

real    0m30.060s
user    0m0.000s
sys     0m0.015s
wasm-opt-fake_set:

real    0m21.832s
user    0m0.015s
sys     0m0.015s

Squareys@[...] /c/Repos/binaryen (main)
$ diff output-wasm-opt.wasmstripped output-wasm-opt-fake_set.wasmstripped

Squareys@[...] /c/Repos/binaryen (main)
$

Explanation: The sets are usually empty (or seem to have few elements) and while EffectsAnalyzer is always constructed in the Stack, e.g. for SimplifyLocals, std::set on the MSVC STL seems to always do a heap allocation in the default constructor, which is taking about 90% of the time in that pass.

This is, of course, very scrappy "benchmarking" (if one can even call it that) and no testing apart from the output diff. So take with many grains of salt, but I think the main takeaway could still be that red-black-tree based containers (e.g. std::map/std::set) allocate in the default constructor and for few elements it might be worth using a simpler container.

Similar optimizations might apply in other locations of the code.

The execution times were checked as follows
echo "wasm-opt:"
time ./build-ninja/bin/wasm-opt --strip-dwarf --post-emscripten -O3 --low-memory-unused --zero-filled-memory --strip-debug --strip-producers WonderlandRuntime-loader-physx.wasm -o WonderlandRuntime-loader-physx.wasmstripped --mvp-features --enable-mutable-globals
mv WonderlandRuntime-loader-physx.wasmstripped output-wasm-opt.wasmstripped

echo "wasm-opt-fake_set:"
time ./wasm-opt-fake_set --strip-dwarf --post-emscripten -O3 --low-memory-unused --zero-filled-memory --strip-debug --strip-producers WonderlandRuntime-loader-physx.wasm -o WonderlandRuntime-loader-physx.wasmstripped --mvp-features --enable-mutable-globals
mv WonderlandRuntime-loader-physx.wasmstripped output-wasm-opt-fake_set.wasmstripped

@MaxGraey
Copy link
Contributor

MaxGraey commented Jul 4, 2022

Could you try small_set in EffectsAnalyzer instead of your custom fake_set?

@Squareys
Copy link

Squareys commented Jul 4, 2022

@MaxGraey small_set calls the default constructor of std::set, so it wouldn't remove the heap allocation 🤔

(That the heap allocation is the culprit was found through a sampling profiler (i.e. "Superluminal"), it's not the count() or insert() functions that are taking the time here. In most cases the set is empty. The difference for std::vector is, that it does not allocate in the default constructor for 0 elements).

@MaxGraey
Copy link
Contributor

MaxGraey commented Jul 4, 2022

Hmm. in this case I guess make sense use small_sets instead std::sets but init/alloc they lazily (by wrapping by std::unique_ptr for example) in EffectsAnalyzer

@Squareys
Copy link

Squareys commented Jul 4, 2022

@MaxGraey Definitely, that would probably also improve other code that already uses small_set 👍

There also was a bunch of heap allocations around BasicBlock btw, I think the factory function there always allocates on the heap.

Is the change for small_set and use of small_set in EffectsAnalyzer something that would be accepted as a PR? (@kripken?)
Can't promise to get to it too soon, but since the change could be fairly high impact, I think it'd be great to do (happy to sponsor it otherwise, if someone's up for that)

@MaxGraey
Copy link
Contributor

MaxGraey commented Jul 4, 2022

Also instead of std::set<Index> or small_set<Index> probably better use bit packed sets like std::bitset / dynamic_bitset. It should be faster. Perhaps create small_bitset

@kripken
Copy link
Member Author

kripken commented Jul 6, 2022

@Squareys The MSVC STL does an allocation in the default constructor of std::set? That's surprising...

It might be worth measuring your code on another STL, just as a comparison point. But, even if it just speeds up MSVC that might be enough of a reason to do this - 25%+ is a lot!

As for how to do this, I am a little worried about the fake-set approach since it has linear behavior on large sizes. Those large sizes are not happening on your testcase I guess, but in general they are possible. Some possible options:

  • A limited-size-fake-set, that aborts if the size grows beyond a point. That would only be usable on things that have a known maximum size, or where the optimizer knows that beyond some size it will bail out anyhow.
  • Wrap the std::set in SmallSet with std::optional. Then its constructor is only called when actually used. But, that might have a cost.
  • Avoid using the system STL. We could just bundle libc++ and get uniform behavior everywhere.

@Squareys
Copy link

Squareys commented Aug 6, 2022

The MSVC STL does an allocation in the default constructor of std::set? That's surprising...

@kripken Yes, stack trace is as follows with links to he Microsoft STL repository:

I honestly have trouble reading the style of code there, though, so please point out in case I'm misinterpreting this here.
One way or another: the allocations were found through profiling with Superluminal, so they are certainly happening.

I'll attempt and contribute the SmallSet change, should be fairly straight forward. An alternative could be a custom allocator that stores the first allocated value as a member and copies that into the first larger allocation, but that's really complex compared to the SmallSet.

@Squareys
Copy link

Squareys commented Aug 7, 2022

I found another interesting bottleneck where all threads seem to be blocked by I/O in one of the threads, a millisecond or more at a time, alternating between all threads saturated and this entirely blocked state:

image

Interestingly, the "I/O" here, is trap("unreachable"), which throws an exception and the catching seems to be causing a massive stall in the entire application. I unsure if there's a more hidden reason an exception needs to thrown here, since it's converted to a Flow(NONCONSTANT_FLOW) later anyway, returning that instead should be totally valid and sufficient as far as I can tell.

I'll try to apply this change and let you know how that goes.

@MaxGraey
Copy link
Contributor

MaxGraey commented Aug 7, 2022

Yeah, exceptions is definitely caused perf penalties. See this comment

@Squareys
Copy link

Squareys commented Aug 7, 2022

Replacing the trap() (that throws the except) with return Flow(NONCONSTANT_FLOW) seems to work perfectly fine and removes all those stalls.

@Squareys
Copy link

Squareys commented Aug 7, 2022

Finally, it seems like there is a couple of chunks of 1 second + where all the threads wait for work to complete on one of the workers. Investigating if there's a way to make that faster or balance the workload better:
image

It looks like a significant portion of time is spent in a heap allocation in doVisitLocalGet in LocalGraph.cpp while emplacing an element in the self->locations map.

=> Allocating LocalGraph::locations in the constructor upfront (and changing it from std::map to std::unordered_map in the process, unsure if this actually valid to do) with the amount of get and sets expected seems to reduce these stalls by ~30 %. I did this by creating a CountAll analogous to FindAll and running it on the func->body for set and get in MergeLocals, passing the count to a new parameter to LocalGraph for reserving in locations.

Happy to open a PR for this and the exception change, if you determine it valuable.

@kripken
Copy link
Member Author

kripken commented Aug 8, 2022

Replacing the trap() (that throws the except) with return Flow(NONCONSTANT_FLOW) seems to work perfectly fine and removes all those stalls.

I think that's a good idea. One thing that will need to be done though is to handle it in the fuzzer, I think: we need to log such traps (so we can compare them). But I think that might be a small addition in execution-results.h. A PR would be great!

The LocalGraph change sounds very interesting as well! I'd need to read the PR carefully but I think it can work.

@Squareys
Copy link

Squareys commented Aug 8, 2022

Amazing, will have time create those PRs tomorrow evening CEST then!

@Squareys
Copy link

Squareys commented Aug 9, 2022

While not throwing an exception saves a few milliseconds and makes the profiles look much nicer, it's not really significant compared to the 16000 - 20000 milliseconds the entire wasm-opt pass is taking. I wonder if it's worth risking potentially breaking something here. I'll create a PR, but feel free to discard it, if you see any risk.

@Squareys
Copy link

Squareys commented Aug 9, 2022

And the preallocation while in theory should be an improvement, is not significant enough to escape the variance in the timing results 🤔

@Squareys
Copy link

Squareys commented Aug 9, 2022

After some more hours of investigation, I found another thing: Function in wasm.h is using 7 std::set and std::unordered_map, which for my case is being allocated 4799 times (potentially more often, increasing some static counter in the destructor, not sure if that's thread safe).

That's around 4799 function instantiations * 7 heap allocations * 0.15 ms/allocation = 5 seconds of cost for the default constructors alone (parallelized on multiple threads, though).

Printing the sizes of the sets and maps in the destructor shows that they are always empty? I saw some uses in the code though, so I'm not sure what's going on here. They might just never be hit for this binary.

Anyway, the solution to replace the Microsoft STL seems more and more useful to me 😅

kripken added a commit that referenced this issue Mar 21, 2023
This makes the pass 2-3% faster in some measurements I did locally.

Noticed when profiling for #5561 (comment)

Helps #4165
@kripken
Copy link
Member Author

kripken commented Oct 11, 2023

In #4261 the idea came up to use a more efficient hashmap, which could be smaller and faster than the default stdlib one that we use atm. That would be great to look into.

@kripken
Copy link
Member Author

kripken commented Oct 23, 2023

#6042 has some notes on LocalGraph improvements that could help a particularly large testcase.

@kripken
Copy link
Member Author

kripken commented Feb 27, 2024

Looking at profiles it seems that allocations on the main traversal stack may be significant,

SmallVector<Task, 10> stack; // stack of tasks

And other structures in that file too, just because they are the core of all of our walks.

Perhaps we can arena-allocate those stacks and keep reusing them.

radekdoulik pushed a commit to dotnet/binaryen that referenced this issue Jul 12, 2024
This makes the pass 2-3% faster in some measurements I did locally.

Noticed when profiling for WebAssembly#5561 (comment)

Helps WebAssembly#4165
@kripken
Copy link
Member Author

kripken commented Sep 24, 2024

Over the last month I focused on compile times, and as a result I see an over 2x speedup on wasm-opt -O3 on two large testcases I've been looking at, one C++ (Wasm MVP) and one Java (WasmGC). The improvements are pretty general so I hope this speeds up most builds.

Some details (in reverse chronological order:

@arsnyder16
Copy link
Contributor

arsnyder16 commented Oct 1, 2024

@kripken Thanks for these! I am going to get some numbers on where my wasm file stands before (3.1.65) and after (3.1.68). using BINARYEN_PASS_DEBUG=1 One thing i am noticing is that the validation steps take a considerable amount of time but there doesn't appear to be a way to show the timings that those take.

Has there been any optimizing of the validation code or anything planned?

i am disabling validation here

3.1.65 ``` ++ ls -l 3.1.65.wasm -rw-r--r-- 1 root root 99147167 Oct 1 08:52 3.1.65.wasm ++ BINARYEN_PASS_DEBUG=1 ++ time /root/emsdk/upstream/bin/wasm-opt --no-validation --strip-target-features --post-emscripten -Os --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --no-stack-ir 3.1.65.wasm -o 3.1.65after.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-multivalue --enable-mutable-globals --enable-reference-types --enable-sign-ext [PassRunner] running passes [PassRunner] running pass: strip-target-features... 0.000540104 seconds. [PassRunner] running pass: post-emscripten... 1.49346 seconds. [PassRunner] passes took 1.494 seconds. [PassRunner] running passes [PassRunner] running pass: duplicate-function-elimination... 3.63673 seconds. [PassRunner] running pass: remove-unused-module-elements... 1.28464 seconds. [PassRunner] running pass: memory-packing... 0.247454 seconds. [PassRunner] running pass: once-reduction... 0.184166 seconds. [PassRunner] running pass: ssa-nomerge... 11.8826 seconds. [PassRunner] running pass: dce... 3.7628 seconds. [PassRunner] running pass: remove-unused-names... 0.570026 seconds. [PassRunner] running pass: remove-unused-brs... 3.11129 seconds. [PassRunner] running pass: remove-unused-names... 0.526151 seconds. [PassRunner] running pass: optimize-instructions... 2.87017 seconds. [PassRunner] running pass: pick-load-signs... 0.740516 seconds. [PassRunner] running pass: precompute... 1.71788 seconds. [PassRunner] running pass: optimize-added-constants-propagate... 23.1096 seconds. [PassRunner] running pass: code-pushing... 1.90392 seconds. [PassRunner] running pass: tuple-optimization... 0.025315 seconds. [PassRunner] running pass: simplify-locals-nostructure... 5.94543 seconds. [PassRunner] running pass: vacuum... 2.0397 seconds. [PassRunner] running pass: reorder-locals... 0.771755 seconds. [PassRunner] running pass: remove-unused-brs... 2.13625 seconds. [PassRunner] running pass: coalesce-locals... 5.68479 seconds. [PassRunner] running pass: local-cse... 3.06765 seconds. [PassRunner] running pass: simplify-locals... 7.11051 seconds. [PassRunner] running pass: vacuum... 1.78366 seconds. [PassRunner] running pass: reorder-locals... 0.745307 seconds. [PassRunner] running pass: coalesce-locals... 9.89549 seconds. [PassRunner] running pass: reorder-locals... 0.72067 seconds. [PassRunner] running pass: vacuum... 1.67271 seconds. [PassRunner] running pass: code-folding... 4.17729 seconds. [PassRunner] running pass: merge-blocks... 1.37191 seconds. [PassRunner] running pass: remove-unused-brs... 2.49119 seconds. [PassRunner] running pass: remove-unused-names... 0.520844 seconds. [PassRunner] running pass: merge-blocks... 1.26686 seconds. [PassRunner] running pass: precompute... 1.63496 seconds. [PassRunner] running pass: optimize-instructions... 2.50066 seconds. [PassRunner] running pass: rse... 6.76445 seconds. [PassRunner] running pass: vacuum... 1.80028 seconds. [PassRunner] running pass: dae-optimizing... 23.9783 seconds. [PassRunner] running pass: inlining-optimizing... 20.7845 seconds. [PassRunner] running pass: duplicate-function-elimination... 0.787577 seconds. [PassRunner] running pass: duplicate-import-elimination... 0.0406488 seconds. [PassRunner] running pass: simplify-globals-optimizing... 0.243489 seconds. [PassRunner] running pass: remove-unused-module-elements... 1.02584 seconds. [PassRunner] running pass: reorder-globals... 0.00628154 seconds. [PassRunner] running pass: directize... 0.145466 seconds. [PassRunner] passes took 166.688 seconds. [PassRunner] running passes [PassRunner] passes took 0 seconds. 459.78user 0.97system 2:53.95elapsed 264%CPU (0avgtext+0avgdata 2059088maxresident)k 0inputs+82992outputs (0major+444664minor)pagefaults 0swaps ++ ls -l 3.1.65after.wasm -rw-r--r-- 1 root root 42489720 Oct 1 09:10 3.1.65after.wasm ```
3.1.68 ``` ++ ls -l 3.1.68.wasm -rwxr-xr-x 1 root root 99154714 Oct 1 09:13 3.1.68.wasm ++ BINARYEN_PASS_DEBUG=1 ++ time /root/emsdk/upstream/bin/wasm-opt --no-validation --strip-target-features --post-emscripten -Os --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --no-stack-ir 3.1.68.wasm -o 3.1.68after.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-multivalue --enable-mutable-globals --enable-reference-types --enable-sign-ext [PassRunner] running passes [PassRunner] running pass: strip-target-features... 0.000651804 seconds. [PassRunner] running pass: post-emscripten... 1.49796 seconds. [PassRunner] passes took 1.49861 seconds. [PassRunner] running passes [PassRunner] running pass: duplicate-function-elimination... 3.2706 seconds. [PassRunner] running pass: remove-unused-module-elements... 1.13238 seconds. [PassRunner] running pass: memory-packing... 0.256579 seconds. [PassRunner] running pass: once-reduction... 0.200214 seconds. [PassRunner] running pass: ssa-nomerge... 11.7251 seconds. [PassRunner] running pass: dce... 3.35628 seconds. [PassRunner] running pass: remove-unused-names... 0.535297 seconds. [PassRunner] running pass: remove-unused-brs... 3.16159 seconds. [PassRunner] running pass: remove-unused-names... 0.549171 seconds. [PassRunner] running pass: optimize-instructions... 2.80716 seconds. [PassRunner] running pass: pick-load-signs... 0.748634 seconds. [PassRunner] running pass: precompute... 1.60599 seconds. [PassRunner] running pass: optimize-added-constants-propagate... 15.8 seconds. [PassRunner] running pass: code-pushing... 1.85391 seconds. [PassRunner] running pass: tuple-optimization... 0.0240201 seconds. [PassRunner] running pass: simplify-locals-nostructure... 5.95946 seconds. [PassRunner] running pass: vacuum... 1.88777 seconds. [PassRunner] running pass: reorder-locals... 0.847257 seconds. [PassRunner] running pass: remove-unused-brs... 2.23691 seconds. [PassRunner] running pass: coalesce-locals... 10.0595 seconds. [PassRunner] running pass: local-cse... 3.32373 seconds. [PassRunner] running pass: simplify-locals... 7.99739 seconds. [PassRunner] running pass: vacuum... 1.9047 seconds. [PassRunner] running pass: reorder-locals... 0.765433 seconds. [PassRunner] running pass: coalesce-locals... 9.59543 seconds. [PassRunner] running pass: reorder-locals... 0.797222 seconds. [PassRunner] running pass: vacuum... 2.02627 seconds. [PassRunner] running pass: code-folding... 4.30972 seconds. [PassRunner] running pass: merge-blocks... 1.46414 seconds. [PassRunner] running pass: remove-unused-brs... 2.57263 seconds. [PassRunner] running pass: remove-unused-names... 0.532943 seconds. [PassRunner] running pass: merge-blocks... 1.25059 seconds. [PassRunner] running pass: precompute... 1.57593 seconds. [PassRunner] running pass: optimize-instructions... 2.57699 seconds. [PassRunner] running pass: rse... 6.46657 seconds. [PassRunner] running pass: vacuum... 1.78285 seconds. [PassRunner] running pass: dae-optimizing... 26.9038 seconds. [PassRunner] running pass: inlining-optimizing... 12.3328 seconds. [PassRunner] running pass: duplicate-function-elimination... 0.716017 seconds. [PassRunner] running pass: duplicate-import-elimination... 0.0248484 seconds. [PassRunner] running pass: simplify-globals-optimizing... 0.243064 seconds. [PassRunner] running pass: remove-unused-module-elements... 0.98064 seconds. [PassRunner] running pass: reorder-globals... 0.00607164 seconds. [PassRunner] running pass: directize... 0.128283 seconds. [PassRunner] passes took 158.296 seconds. [PassRunner] running passes [PassRunner] passes took 0 seconds. 370.26user 1.29system 2:44.84elapsed 225%CPU (0avgtext+0avgdata 2037692maxresident)k 0inputs+82992outputs (1major+427131minor)pagefaults 0swaps ++ ls -l 3.1.68after.wasm -rw-r--r-- 1 root root 42487858 Oct 1 09:16 3.1.68after.wasm ```

I noticed dae-optimizing might have gotten a little slower, i took a crack at this pass a few years ago but don't have time to revisit, one of these did land but caused a regression that had to be rolled back

#4623
#4641
#4629

@arsnyder16
Copy link
Contributor

I could be mislead here, that validation might only run during BINARYEN_PASS_DEBUG?

@kripken
Copy link
Member Author

kripken commented Oct 1, 2024

@arsnyder16 Yes, the validation times can be ignored in BINARYEN_PASS_DEBUG. A lot more validation is done in that debug mode - both additional costly checks, and also they are done after every pass. Note that it is still slower than normal even with --no-validation because each pass is run to completion before continuing to the next (which is bad for cache locality).

Looks like overall there is a slight improvement, but it might be worth measuring more to reduce the noise. I do perf stat -r 10 (on linux, and without BINARYEN_PASS_DEBUG) which does an average and reports the deviation.

Specifically I see some passes improved as I'd expect (inlining, optimize-added-constants) but it is odd that coalesce-locals regressed. Perhaps this is noise though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@kripken @MaxGraey @Squareys @arsnyder16 @andrewevstyukhin and others