Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document some perf numbers #13

Open
conartist6 opened this issue Jan 11, 2021 · 25 comments
Open

Document some perf numbers #13

conartist6 opened this issue Jan 11, 2021 · 25 comments

Comments

@conartist6
Copy link
Member

I found some test results which benchmark C regex engines. It includes a corpus and some realistic expressions of varying complexity, so I think I can run the test directly.

I'd like to see how this engine stacks up for speed.

If I care enough to try to do an apples to apples comparison I'll also need to build the tests for the other engines so they can run on the same hardware. Really though if I'm within an order of magnitude I'll be happy for now, given that I've done all this in a scripting language.

@conartist6 conartist6 changed the title Do perf test Document some perf numbers Jan 11, 2021
@conartist6
Copy link
Member Author

I should say that I did run the test suite, and this engine is really, quite slow. It was a while ago but IIRC maybe 100x slower than a native implementation. At first I thought I had the wrong computational complexity somewhere, but after I stared at the profiling output for a while (and fixed a few issues) it seemed clear to me that time was being spent in the places I expect it to be spent. I guess it's not ideal to write a regex engine in a scripting language? Who knew.

Anyway, I'm not too worried. 100x slower sounds like a lot, but this is not a tool for searching a huge corpus with a complicated expression. Those exist! This is about matching on streams, which was not previously possible in the browser. When your data is a stream, even 100x slower regex parsing may be a trivial cost compared with being unable to do any work until an input stream terminates. A stream may never terminate!

@conartist6
Copy link
Member Author

I expect to see some very big perf gains from #35. Will measure the speedup soon.

@conartist6
Copy link
Member Author

It might be worth getting rid of the remaining string comparison in favor of symbols, e.g. use Symbol('cont') vs 'cont'. We know that symbols can always be equality checked in a single operation, whereas string comparisons need to go into character-by-character comparison if strings are not reference-equal.

@conartist6
Copy link
Member Author

The next really hot section will be getSeq. Anything I could do to optimize that would be well worth it.

@conartist6
Copy link
Member Author

conartist6 commented Mar 30, 2022

It might speed things up to combine adjacent non-match expressions when possible... It could be done in state.explode, probably relatively easily given that expressions have little functionality or data of their own

@conartist6
Copy link
Member Author

conartist6 commented Mar 30, 2022

I was able to run the benchmark. I'm cheating a bit using better hardware, but I'm much closer the the average performance now than I was before. I'd say maybe 20x slower instead of 100x, so a 5x speedup from my changes, which seems plausible to me. That's incorporating a ~2x fudge factor for the newer hardware, and assuming the average runtime of native libraries on the URI pattern and provided corpus to be ~2s (I didn't do the math).

@conartist6
Copy link
Member Author

I made some more perf improvements but I forget where I but the benchmarking script : (

@conartist6
Copy link
Member Author

conartist6 commented May 24, 2022

Oh I figured out what happened. I was part way through the perf test stuff and I got distracted while I was implementing head to head perf testing with native javascript regex. The results are... not encouraging.

const corpus = readFile('corpus/howtosmall'); // 4MB of text

let exp;

// URI (protocol://server/path)
exp = /([a-zA-Z][a-zA-Z0-9]*):\/\/([^ /]+)(\/[^ ]*)?/g;
execGlobal(exp, corpus); // 3371 ms
exp.exec(corpus); // 42 ms
// slowdown: 80x

// Email (name@server)
exp = /([^ @]+)@([^ @]+)/g;
execGlobal(exp, corpus); // 3325 ms
exp.exec(corpus); // 66 ms
// slowdown: 50x

// Date (month/day/year)
exp = /([0-9][0-9]?)\/([0-9][0-9]?)\/([0-9][0-9]([0-9][0-9])?)/g;
execGlobal(exp, corpus); // 2184 ms
exp.exec(corpus); // 1 ms
// slowdown: 2184x

// URI|Email
exp = /([a-zA-Z][a-zA-Z0-9]*):\/\/([^ /]+)(\/[^ ]*)?|([^ @]+)@([^ @]+)/g;
execGlobal(exp, corpus); // 6413 ms
exp.exec(corpus); // 98 ms
// slowdown: 65x

@conartist6
Copy link
Member Author

The 2000x slowdown is the eye popping one. I assume the native engine has implemented an optimization for certain kinds of extremely simple patterns which allows it to avoid the cost and overhead of a full engine. I think I could easily do something similar for this project.

@conartist6
Copy link
Member Author

The rest of the patterns execute an average of 70x slower than native. Likely closing the gap on those would require a combination of algorithmic optimizations and getting closer to the hardware. Fortunately it is likely that both of these things are possible.

@conartist6
Copy link
Member Author

conartist6 commented May 25, 2022

Something to think about: Rust has a streaming regex engine based on RE2. I think I could just ship that as WASM code with bindings. I wonder how big it would be? It has pretty much all the features I want and need except for lookahead.

@conartist6
Copy link
Member Author

I've been noticing depots on certain termination cases. These cases may not be hit until the engine is otherwise very hot and well optimized, at which point it depots. I think the thing to do might be to run a few small matches as a warm-up to ensure that the ICs for these lines contain data.

@conartist6
Copy link
Member Author

I guess I think for the moment it's worth exploring the optimizations that are still possible in the scripted code. That includes storing nextSeq in the engine's data structure so that it can be accessed like a linked list and caching DFA representations for some parts of the graph.

@conartist6
Copy link
Member Author

conartist6 commented May 25, 2022

Also possible: keeping pools of expression and sequence nodes to avoid some GC costs? (Think React events)

@conartist6
Copy link
Member Author

OK, I think I still have some REALLY 👏 DUMB 👏 SHIT going on which I can probably gain a lot of perf by fixing. Most importantly I'm not actually eliminating all the worse patterns when a better pattern succeeds. This is likely to be a real perf-killer for global expressions. It isn't just a matter of semantics, I really need to represent all the sequences being considered as a single linked list best -> worst. Then seq.removeWorse() will simply remove its link to the next node and ALL worse sequences will be gone, including those belonging to parent expressions which we currently miss!

In this design the engine doesn't really know about expressions at all -- and why should it? They only really matter in terms of operator precedence and capturing, and both those things are handled elsewhere.

I know I sound like I'm having a major revelation, and I am, but I also in having this revelation am understanding that I'm the very last person to come to a pretty obvious conclusion, and that in retrospect many of the other tools whose source code I read do exactly this. That is why what I call an engine is often termed a "VM" as its principal occupation is dealing with the maintenance of a stack data structure.

I suspect my inability to see this design when I initially built the engine resulted from my "let's just start tinkering" approach, where neither captures nor the engine were features of the simple blog-post engine I used as my starting point. Thus when I had the least understanding of the problem space I read I was designing and building these parts of the system. In fact, I think it wasn't until I started writing tests that I fully understood the match priority mechanics. At first I thought a longer match was always better! Oops.

With that approach also comes trivial iteration over the active states, making the potential perf improvement, well, quite significant.

@conartist6
Copy link
Member Author

conartist6 commented May 27, 2022

OK I reworked the engine, I have the test suite passing again, and I don't have any deopts, but while the code is significantly better now it didn't help the perf as much as I thought it might. Some cases got a modest speedup.

The numbers are:

// URI (protocol://server/path)
exp = /([a-zA-Z][a-zA-Z0-9]*):\/\/([^ /]+)(\/[^ ]*)?/g;
execGlobal(exp, corpus); // 2926 ms

// Email (name@server)
exp = /([^ @]+)@([^ @]+)/g;
execGlobal(exp, corpus); // 2960 ms

// Date (month/day/year)
exp = /([0-9][0-9]?)\/([0-9][0-9]?)\/([0-9][0-9]([0-9][0-9])?)/g;
execGlobal(exp, corpus); // 5107 ms

// URI|Email
exp = /([a-zA-Z][a-zA-Z0-9]*):\/\/([^ /]+)(\/[^ ]*)?|([^ @]+)@([^ @]+)/g;
execGlobal(exp, corpus); // 6413 ms

That said this is a complete rework of the engine code, and there may still some gains to be had from fine tuning, as well as other optimization possibilities my changes have opened up.

@conartist6
Copy link
Member Author

conartist6 commented May 27, 2022

I've very curious to know how much of the current burden is on the GC. That will inform which direction I go next. I'm also fairly sure I can do better on monomorphism of the engine data structure. I'm still using class extension in the engine data structure even though I have recently discovered that class extension is a known perf-killer in high-performance data structures. I didn't want to change the code style at the same time I was changing the design. It was a big enough job already!

@conartist6
Copy link
Member Author

I've merged the work from my branch. The current numbers as of 196239e are:

// URI (protocol://server/path)
exp = /([a-zA-Z][a-zA-Z0-9]*):\/\/([^ /]+)(\/[^ ]*)?/g;
execGlobal(exp, corpus); // 3141 ms

// Email (name@server)
exp = /([^ @]+)@([^ @]+)/g;
execGlobal(exp, corpus); // 3142 ms

// Date (month/day/year)
exp = /([0-9][0-9]?)\/([0-9][0-9]?)\/([0-9][0-9]([0-9][0-9])?)/g;
execGlobal(exp, corpus); // 2197 ms

// URI|Email
exp = /([a-zA-Z][a-zA-Z0-9]*):\/\/([^ /]+)(\/[^ ]*)?|([^ @]+)@([^ @]+)/g;
execGlobal(exp, corpus); // 5488 ms

@conartist6
Copy link
Member Author

conartist6 commented May 27, 2022

Currently bench.js matches URI|Email a bit faster than perf.test.js does) -- 4s to 5.4s. During that 4s 228ms are spent in the garbage collector. Not bad, I'd say.

@conartist6
Copy link
Member Author

OK yep this is all pretty close to the correct perf costs I think. There are still some smaller optimizations still to make: I don't think I'm doing myself any favors by allowing null as a result. I just end up having to check for it a lot of places. But that might get my 5-10%, and I still need ways to reduce costs by orders of magnitude. That likely means it's time to shift the focus towards algorithmic optimizations like DFA and perhaps special cases for non-branching patterns.

@conartist6
Copy link
Member Author

I would also like to get some more parser-ish test cases since my current benchmark focuses more on a type of use case that isn't my primary concern to support. I want more perf feedback on testing small patterns with sticky matching.

@conartist6
Copy link
Member Author

OK I can't wring any more water from this stone. I think it's time to look at the dreaded DFA optimization. I've heard it can be an order of magnitude speedup, and that's what I need.

I found a nice post that talks about the implementation of this optimization in formal terms that I am not very familiar with: https://nitely.github.io/2020/01/19/a-dfa-for-submatches-extraction.html

Fortunately now that the code is the way it is I'm fairly confident that I understand (or will be able to understand) what is meant. What I've dubbed "matchers" are transitions, and my distinction between width1 matchers and width0 matchers is the difference between transitions and epsilon (ε) transitions.

@conartist6
Copy link
Member Author

@conartist6
Copy link
Member Author

I think the most useful resource of all will be https://github.com/rust-lang/regex/blob/master/src/dfa.rs

They've based this code on RE2, including RE2's "DFA caching" algorithm. Plus its aggressively commented, and in a language that's a) less messed up than c++ and b) I should learn anyway.

@conartist6
Copy link
Member Author

OK, biggest question first. Rust is portable to JS via WASM. Should I just try to use the rust engine as written? I don't really know how much work it would be, nor how fast it would be. Unlike re2 it would be running natively, even in a browser. It would need to ping-pong back and forth between JS and WASM 1/2x per input character, but I suspect that cost is not likely the be the dominant one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant