Skip to content

02. Table‐driven decoder

flysand7 edited this page Sep 21, 2024 · 1 revision

Table-driving

Last time we have come up with a textual representation of instruction encoding. This wasn't just to understand what is really needed to decode an isntruction, but also to set us up for table-driving our disassembler.

image

The reason we want to table-drive our disassembler should be quite obvious. It's incredibly hard to write a disassembler using a giant switch statement, and it gets much, much worse the more instruction sets you add into the mix.

Table-driven code is one of the beautiful manifestations of the idea that data can act as code. x86-disasm uses a textual format not too different from the one we've come up with as the input to a table generator that outputs source code that contains tables in a machine-readable format.

But overall there are a few approaches to table-driving we could use alternatively:

  1. Have the disassembler read the textual format and use the parsed file to decode instructions;
  2. Have a generator that will convert the textual format into source code that parses the necessary opcodes;
  3. Have a generator that will convert the textual format into source code file that contains tables in a machine-readable format.

Approach 1 could be useful for a kind of decoder that doesn't need recompilation before retrying the table. This approach can be useful in early stages of development, but not much else there is about it.

Approach 2 can be better in some ways than approach 3, because compilers can optimize your code, they can't optimize your data. But I'm not sure, I haven't tried doing that approach. It sounds like a lot of pain.

Two-stage tables

Let's think about the simplest way to put all of the inforation from our textual table into a machine-readable format. The first thing that comes to mind is an array of 256 entries, each indexed by the opcode and containing all of the information from the textual table. Great!

Wait, no, there's RX-extended opcodes. So when we encounter the opcode, we do want to dereference some kind of opcode table, check if we have the mod/rm byte and rx-extended opcode.

In case of rx-extended opcodes we can add another table, that will contain all rx extensions. One rx extension is an array of 8 elements, indexed by rx field of the mod/rm byte, storing the index to the rest of the encoding.

In case of non-rx-extended opcodes, the opcode table directly stores the index to the rest of the encoding.

Schematically, the table would look like this:

image

Some parts of the encoding, like the encoding kind, extra operand kind and data size override will be present in the first stage (opcode table). Because this is all the information we need to obtain the length of instruction without fully decoding it, which we'll want to have for later (foreshadowing).

The parts of encoding that dictate how the instruction should be interpreted are stored in the second stage of the table.

Code for decoding the table

disasm_one :: proc(bytes: []u8) -> (res: Instruction, idx: int, ok: bool) {
    /* ... Parse prefixes ... */
    opcode := bytes[idx]
    // Decode stage 1
    stage1_entry := stage1_table[opcode]
    (stage1_entry.entry_idx != 0) or_return
    if stage1_entry.force_ds != DS_DEFAULT {
        ds = stage1_entry.force_ds // Apply data size override
    }
    /* ... Parse mod/rm ... */
    /* ... Parse extra operands ... */
    // Decode stage 2
    encoding := Encoding {}
    if stage1_entry.kind == .Rx_Extend {
        stage2_idx := rx_ext_table[stage1_entry.entry_idx][modrm_byte.rx]
        encoding = stage2_table[stage2_idx]
    } else {
        encoding = stage2_table[stage1_entry.entry_idx]
    }
    /* ... Interpret rx and rm from parsed modrm ... */
    /* ... Apply instruction flags ... */
    res = Instruction {
        mnemonic = encoding.mnemonic,
        flags = flags,
        rx_op = rx,
        rm_op = rm,
        extra_op = eop,
    }
    ok = true
    return
}

Generating the tables

Here's me after spending 6 hours writing code to come from textual format into the machine-readable format:

image

Well, it's nothing that hard, because it only took 6 hours and a little bit of debugging. Our format is simple enough that it can be easily parsed and transformed.

The actual code for table generation acts more like a CLI tool. In order to speed up debugging I added flags to print the parsed output.

flysand7@~/x86-disasm(main)$ ./table-gen .\tables\16bit.txt disasm/table_gen.odin -line:10
mov a2 rx=GPReg(0) rm=GPReg eop=Disp ds=1 
+-> Stage1 [0xa2] = Stage1_Encoding{kind = "None", entry_idx = 9, eop = "Disp", force_ds = 1}
+-> Stage2: [162] = Encoding{mnemonic = "mov", flags = bit_set[Table_Entry_Flag]{}, rx_value = 0, rx_kind = "GPReg", rm_kind = "GPReg"}

We can print the whole thing, only a mnemonic, a specific opcode or an entry coming from a specific line of the text file. In the snipped above I show an example of printing the encoded information about the mov instruction on line 10. This is incredibly helpful for debugging table generator.