Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework IR as a graph of extended basic blocks #104

Closed
jserv opened this issue Jan 4, 2023 · 10 comments
Closed

Rework IR as a graph of extended basic blocks #104

jserv opened this issue Jan 4, 2023 · 10 comments
Assignees

Comments

@jserv
Copy link
Contributor

jserv commented Jan 4, 2023

The intermediate representation (IR) function used in wip/jit branch is a graph of basic blocks. Extended basic block (EBB; or superblock) is a collection of BBs with one label at the beginning and internal labels, each of which is the target of only one internal jump and no external jumps. EBB may contain internal branch instructions and is closer to how machine code works.

LLVM uses phi instructions in its SSA representation. Cranelift passes arguments to EBBs instead. The two representations are equivalent, but the EBB arguments are better suited to handle EBBs that may contain multiple branches to the same destination block with different arguments. Passing arguments to an EBB looks a lot like passing arguments to a function call, and the register allocator treats them very similarly. Arguments are assigned to registers or stack locations.

The definition of Control flow graph (CFG):

  • A rooted directed graph G = (N,E), where N is given by the set of basic blocks + two special BBs: entry and exit.
  • And edge connects two basic blocks b1 and b2 if control can pass from b1 to b2.
  • An edge(s) from entry node to the initial basic block(s?)
  • From each final basic blocks (with no successors) to exit BB.

Extended basic block

  • a maximal sequence of instructions beginning with a leader that contains no join nodes other than its first node.
  • Has a single entry, but possible multiple exit points.
  • Some optimizations are more effective on extended basic blocks.

Extended basic block

We can identify loops by using dominators

  • a node A in the flowgraph dominates a node B if every path from entry node to B includes A.
  • This relations is antisymmetric, reflexive, and transitive.
  • back edge: An edge in the flow graph, whose head dominates its tail
  • A loop consists of all nodes dominated by its entry node (head of the back edge) and having exactly one back edge in it.

identify loops

The goal of dominators and postdominators is to determine loops in the flowgraph.

Use case: ARMware, an ARMv4 / Compaq iPAQ emulator, has a built-in threaded code engine which will cache an EBB (extended basic block) of ARM codes, so that it can increase the execution speed. Further more, ARMware has a built-in dynamic compiler which will translate an EBB of ARM codes into a block of x86 machine codes, so that it can increase the runtime performance dramatically. The optimization techniques implemented in this dynamic compiler include:

  • Redundant condition code calculation elimination
  • Global grouping conditional execution instruction
  • Redundant jump elimination
  • Dead code elimination
  • Constant folding
  • Global Common Subexpression Elimination
  • Global redundant memory operation elimination
  • Algebraic canonicalization
  • Global SSA form based linear scan register allocation

Reference:

@qwe661234
Copy link
Collaborator

According to our strategy for extended basic blocks, if the branch instruction of previous basic block end is not taken, this branch instruction would not be regarded as the end of basic block. Therefore, we continue to add the instructions to the previous basic block to create an extended basic block. However, when we emulate the same block again, the control flow can alter, thus we need to add some mechanism to check this circumstance. The implementation of this approach will be added to rvemu later.

@qwe661234
Copy link
Collaborator

Another method we're planning to use to reduce memory usage is to discard basic blocks or extended basic blocks after emulation rather than adding them to cache if the number of instructions in them is less than a predetermined threshold.

@qwe661234
Copy link
Collaborator

qwe661234 commented Mar 17, 2023

The commit 9f9e41b implements the strategy of extended basic blocks. With extended basic blocks, the perfomance increases more than 7.6 % and the memory usage decreases more than 16%.

  • Baseline
Model Compiler Iterations / Sec Memory uasge
Core i7-8700 clang-15 971.951 42,778,469
Core i7-8700 gcc-12 963.336 43,172,197
  • Extend basic blocks
Model Compiler Iterations / Sec Memory uasge Speedup
Core i7-8700 clang-15 1045.885 36,872,501 +7.8%
Core i7-8700 gcc-12 1038.956 36,257,301 +7.6%

The commit 07892d2 implements the strategy of discarding basic blocks contain less intructions. According to the data, performance declines as we discard more blocks. As a result, less memory is needed. In code-copying and JIT, we think that block discarding can be more effective.

  • Discard blocks
Test Case Compiler Iterations / Sec Memory uasge
Baseline clang-15 971.951 42,778,469
Baseline gcc-12 963.336 43,172,197
Retain the number of instruction in a block >= 2 clang-15 883.651 39,259,477
gcc-12 885.56 39,382,517
Retain the number of instruction in a block >= 3 clang-15 771.202 34,485,525
gcc-12 763.199 33,821,109
Retain the number of instruction in a block >= 5 clang-15 611.651 15,069,813
gcc-12 583.227 15,094,421
Retain the number of instruction in a block >= 10 clang-15 358.093 5,521,909
gcc-12 322.733 5,595,733
  • Extend basic blocks + Discard blocks
Test Case Compiler Iterations / Sec Memory uasge
Baseline clang-15 971.951 42,778,469
Baseline gcc-12 963.336 43,172,197
Retain the number of instruction in a block >= 2 clang-15 986.645 34,042,581
gcc-12 969.388 39,382,517
Retain the number of instruction in a block >= 3 clang-15 858.206 20,040,629
gcc-12 852.348 19,868,373
Retain the number of instruction in a block >= 5 clang-15 702.116 13,445,685
gcc-12 675.100 13,470,293

@qwe661234
Copy link
Collaborator

The results demonstrate the performance of the extended basic block(EBB) and the basic block(BB) in various benchmark tools.

dhrystone

  • Model: Core i7-8700
Test Compiler DMIPS SpeedUp
BB clang-15 871
BB gcc-12 897
EBB clang-15 841 -3.5%
EBB gcc-12 873 -2.7%

nqueens

  • Model: Core i7-8700
Test Compiler msec SpeedUp
BB clang-15 9763.4
BB gcc-12 9843.38
EBB clang-15 9760.16 +0%
EBB gcc-12 9473.54 +3.9%

richards

  • Model: Core i7-8700
Test Compiler us SpeedUp
BB clang-15 364194
BB gcc-12 354272
EBB clang-15 355469 +2.5%
EBB gcc-12 344543 +2.8%

fib45

  • Model: Core i7-8700
Test Compiler msec SpeedUp
BB clang-15 16,4703.04
BB gcc-12 16,2448.84
EBB clang-15 15,4621.30 +6.5%
EBB gcc-12 15,7108.38 +3.4%

@qwe661234
Copy link
Collaborator

The results demonstrate the performance of the extended basic block with unconditional jump(EBB) and the basic block(BB) in various benchmark tools.

  • CoreMark
  • Model: Core i7-8700
Test Compiler Iterations / Sec Memory uasge Speedup
BB clang-15 971.951 42,778,469
BB gcc-12 963.336 43,172,197
EBB clang-15 1118.954 31,040,405 +15.1%
EBB gcc-12 1106.232 32,861,397 +14.8%

dhrystone

  • Model: Core i7-8700
Test Compiler DMIPS SpeedUp
BB clang-15 871
BB gcc-12 897
EBB clang-15 858 -1.5%
EBB gcc-12 843 -6.4%

nqueens

  • Model: Core i7-8700
Test Compiler msec SpeedUp
BB clang-15 9763.4
BB gcc-12 9843.38
EBB clang-15 9178.82 +6.4%
EBB gcc-12 9358.53 +5.2%

richards

  • Model: Core i7-8700
Test Compiler us SpeedUp
BB clang-15 364194
BB gcc-12 354272
EBB clang-15 320206 +13.7%
EBB gcc-12 333502 +6.2%

fib45

  • Model: Core i7-8700
Test Compiler msec SpeedUp
BB clang-15 16,4703.04
BB gcc-12 16,2448.84
EBB clang-15 15,3074.11 +7.6%
EBB gcc-12 15,4818.54 +4.9%

@jserv
Copy link
Contributor Author

jserv commented Mar 19, 2023

Real cases for extended basic block (or super block):

  • QEMU: See TCG Intermediate Representation
    • A TCG extended basic block is a single entry, multiple exit region which corresponds to a list of instructions terminated by a label or an unconditional branch. Specifically, an extended basic block is a sequence of basic blocks connected by the fall-through paths of zero or more conditional branch instructions.
  • libvex (part of Valgrind): See libvex_ir.h. The code is broken into small code blocks ("superblocks", type: IRSB). Each code block typically represents from 1 to perhaps 50 instructions. IRSBs are single-entry, multiple-exit code blocks. Each IRSB contains three things:
    • a type environment, which indicates the type of each temporary
      value present in the IRSB
    • a list of statements, which represent code
    • a jump that exits from the end the IRSB

For high level descriptions of libvex IR, check PyVEX.

@qwe661234
Copy link
Collaborator

qwe661234 commented Mar 22, 2023

Because the previous strategy's performance improvement is limited, we propose a new mechanism for implementing extended basic blocks. For new strategy, any block whose usage times exceed the threshold we set would be extended. We extend basic block by appending the basic block of branch not taken path and branch taken path to it, allowing us to emulate a basic block and its branch target with an internal jump rather than an external jump.

The commit 3e463e0 implements this strategy. With extended basic blocks, the perfomance of runnung CoreMark increases more than 12 %.

Results

  • Model: Core i7-8700
Test Compiler Iterations / Sec Speedup
BB clang-15 971.951
BB gcc-12 963.336
EBB clang-15 1087.284 +11.8%
EBB gcc-12 1110.705 +15.3%

Extended Basic Block Statistics vs Basic Block Statistics

The first statistic image records the number of the original basic block which using times more than 100000 and the number of instruciton it contains. We can see that most blocks have few instructions.

The second statistic image records the number of the extended basic block which using times more than 100000 and the number of instruciton it contains. Because we only extend the basic blocks end with conditional jump instruction, we discovered that all of the extended basic blocks with instructions less than 5 end with unconditional jump instruction. Nevertheless, we still resolve the problem that most blocks have few instructions.

runtime

runtime

@jserv
Copy link
Contributor Author

jserv commented Mar 24, 2023

The second statistic image records the number of the extended basic block which using times more than 100000 and the number of instruciton it contains. Because we only extend the basic blocks end with conditional jump instruction, we discovered that all of the extended basic blocks with instructions less than 5 end with unconditional jump instruction. Nevertheless, we still resolve the problem that most blocks have few instructions.

What are your objectives with respect to blocks with few instructions? I mean, we do need solid algorithms rather than unintentional trials.

@qwe661234
Copy link
Collaborator

Based on our results, we can conclude that the more instructions in a block, the better the performance. However, we do not want the number of instructions in a block to become too large, as this would make it unsuitable for translating to machine code. Moreover, only blocks that are frequently used need to be extended.

  • We only extend a block once to keep it from becoming too large.
  • We set a threshold for only translating blocks that are frequently used. We can only extend it if the usage times exceed the threshold.

We extend basic block by appending the basic block of branch not taken path and branch taken path to it, allowing us to emulate a basic block and its branch target with an internal jump rather than an external jump.

  • External jumps have a high overhead because they involve finding or decoding a basic block. A internal jump, on the other hand, only includes adding an offset to the target IR.
  • This strategy can efficently extend a loop body. We can see the following control flow graph of BB and its EBB after extending, we save two external jump in these two cases.

Case1

Case2

@jserv
Copy link
Contributor Author

jserv commented Apr 19, 2023

We might revisit EBB while implementing JIT compiler. At the moment, let's concentrate on #81.

@jserv jserv closed this as completed Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants