-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate interpreter dispatch methods #97
Comments
Mike Pall, creator of LuaJIT, talked about writing a fast interpreter with control-flow graph optimization. The control-flow graph of an interpreter with C switch-based dispatch looks like this:
Each individual instruction execution looks like this:
We're talking here about dozens of instructions and hundreds of slow paths. The compiler has to deal with the whole mess and gets into trouble:
We can use a direct or indirect-threaded interpreter even in C, e.g. with the computed 'goto &' feature of GCC:
This effectively replicates the load and the dispatch, which helps the CPU branch predictors. But it has its own share of problems:
If you write an interpreter loop in assembler, you can do much
Here's how this would look like:
You can get this down to just a few machine code instructions. |
Fast VMs without assembly - speeding up the interpreter loop: threaded interpreter, duff's device, JIT, Nostradamus distributor by the author of Bochs x86 emulator. |
Virtual Machine Dispatch Experiments in Rust
|
JamVM was an efficient interpreter-only Java virtual machine with code-copying technique.
|
Our experimental results show that the greater the number of instructions in a basic block, the greater the impact TCO has. We also discovered that if we use the branch instruction as the end of a basic block, there are only a few instructions in a basic block. To enlarge the number of instructions in a basic block, we violate the definition of basic block and only use a jump or call instruction as the end of a block, rather than a branch instruction. Related implementation is in branch wip/enlarge_insn_in_block. CoreMark Result
Compare the number of instructions in a basic blockModel: Core i7-8700, Compiler: clang-15
|
Web49 claims to be a faster WebAssembly interpreter, using "one big function" with computed goto. |
This discussion of
The solution mentioned in this discussion, the first solution is grouping instructions that use the same registers into their own functions would help with that (arithmetic expressions tend to generate sequences like this). The second limitations with computed gotos is the inability to derive the address of a label from outside the function. You always end up with some amount of superfluous conditional code for selecting the address inside the function, or indexing through a table. One solution in this discussion is exported goto labels directly using inline assembly. Further, inline assembly can now represent control flow, so you can define the labels in inline assembly and the computed jump at the end of an opcode. That's pretty robust to compiler transforms. |
To investigate dispatch method, machine code inlining, I intend to follow the machine code inlining technique in jamvm and rewrite the dispatch funciton of |
Superinstruction is well-known techniques of improving performance of interpreters, eliminating jumps between virtual machine operations (interpreter dispatch) and enabling more optimizations in merged code. Quote from Towards Superinstructions for Java Interpreters:
See also: Threaded code and quick instructions for Kaffe
US Patent USRE36204E Method and apparatus for resolving data references in generated code filed by Sun Microsystems has been expired. |
This branch is for investigate code-copying dispatch. However, there are some issues now, as we cannot reuse the copied page to emulate. It can pass some fundamental tests, such as |
Can you use GCC's |
Only the function Using inline assembly to push the right return address and jump to the function |
We can reuse the copied page to emulate some fundamental tests in the most recent commit of this branch, but some tests still need to be fixed. There is an important issue with memory page size; some basic blocks in our 'arch-test' are so large that they require approximately 95537 bytes memory. This problem can be solved by increasing the memory page size to more than 95537 bytes, but this wastes memory because most basic blocks require less than 8152 bytes memory. |
The problem was resolved by roughly estimating how many memory pages a basic block needed before being allocated. |
This branch is used to investigate code-copying dispatch, and the latest commit, compiled with
In the latest commit, I modified all functions called by copying code as function pointers stored in structure |
It looks reasonably fine because there are many small blocks. Then, extend the scope of block by changing the way to separate blocks. |
The Cacao interpreter contains several novel research papers along with open source work. |
Close in favor of baseline JIT compiler. |
It would still make sense to consolidate the existing interpreter as the foundation of tiered compilation before we actually develop JIT compiler (#81). See A look at the internals of 'Tiered JIT Compilation' in .NET Core for context. Although #95 uses tail-cail optimization (TCO) to reduce interpreter dispatch cost, we still need to investigate at several interpreter dispatch techniques before deciding how to move forward with more performance improvements and code maintenance.
The author of wasm3 provides an interesting project interp, which implements the following methods:
Preliminary experiments on Intel Xeon CPU E5-2650 v4 @ 2.20GHz with bench.
[ Calls Loop ]
[ Switching ]
[ Direct Threaded Code ]
[ Token (Indirect) Threaded Code ]
[ Tail Calls ]
[ machine code Inlining ]
After #95 is merged, we are concerned about
Reference:
The text was updated successfully, but these errors were encountered: