Skip to content

01. Core ideas in 8086 encoding

flysand7 edited this page Sep 18, 2024 · 6 revisions

Ever since Odin has gotten the long-awaited bit_field feature I started to want to return to working on this disassembler project. The reason being, when I first started working on it, I wanted to make it fast and to make it fast I wanted to compress my decoding tables, which resulted in me simulating bitfields with procedures, which was a lot of effort, especially when I needed to change the amount of bits allocated per field. There were also some scenarios where having bitfields would be "nice", for example not having to read out individual bits of Mod/RM bytes and REX prefixes. In any case, now I feel like writing a dissassembler is a bit easier than it was before.

The other part of the reason why I returned was one of the favours I received to break down and explain my table-driven x86 decoder. I like explaining stuff, so here we go.

The series of articles posted here are aimed to explain the x86 decoding to folks who are curious how the x86 encoding is organized and how one can use the hidden assumptions to simplify the process of decoding. There are very few resources that can break it down in an easy-to-digest format. The reason for that is quite obvious: x86 encoding seems complicated. Well, at first anyway. Understanding the x86 encoding on a deep enough level to explain it to someone else well is a serious time dedication.

The way we're going to study the x86 encodings will be straightforward. We'll from the simplest possible subset of x86, in 16-bit code. At the start we'll only consider 8086 and gradually build up our understanding of the encoding as we're introducing new concepts. During the course of these articles I will be introducing my own terminology, because, well, intels' is confusing and verbose.

Introduction

The x86 instruction set architecture has evolved over the years, in a backwards compatible way. What's interesting to us is that within each new instruction set added into x86, the ideas about encoding stay consistent across all instructions, with some exceptions. Therefore studying one instruction set also sets us up to understanding newer instruction sets.

For our disassembler we will be covering the following instruction sets:

  • 16-bit mode
  • 32-bit mode
  • FPU instructions
  • 64-bit mode
  • MMX
  • SSE, SSE2, SSE3, SSSE3
  • AVX, AVX2
  • Maybe some other extensions...

AVX512 will be left out on purpose.

This can sound like a long list, but really, we only have to care about the first two modes, the 64-bit mode, SSE and AVX, because those are the major points where new ideas are introduced in any major capacity. All other instruction sets are simple extensions of the original ideas.

1. Instruction encoding

x86 instructions are encoded using variable-length byte stream, meaning a single instruction can take up a variable amount of bytes in memory. An instruction encoding consists of several parts:

  • Optional prefix bytes
  • Opcode (upto 3 bytes)
  • Mod/RM byte
  • Immediate data

In terms of disassebling here's what that means. We first would parse the prefix bytes, independent of the kind of instruction we're going to encounter next. Then we read the opcode and based on the opcode determine whether mod/rm byte is present and what kind of immediate data follows the instruction.

Operand encoding

In 8086, an instruction can have upto three operands. We'll give each one of these operands a name, and whatever encoding we'll encounter, it's generally going to fall into one of these types. The operands kinds are:

  • rx (A register operand)
  • rm (A register or memory operand)
  • eop (Extra operand)

An instruction can not have two operands of the same kind. The reason we make this distinction is to greatly simplify our in-memory representation of the parsed instruction and more easily define the encoding table. Because these operand types have a strict ordering within an instruction.

Syntax Operand Order
Intel rm, rx, eop
AT&T eop, rx, rm

Note that rm is the default destination for instructions, and rx is the default source. Depending on instruction encoding, these two operands can swap places, so rm will be the source and rx will be the destination.

These operands also have specific places where they are encoded in the instruction:

  • rx: Within the low 3 bits of the opcode, in the rx field of the mod/rm byte or it's implicit.
  • rm: Within the rm field of the mod/rm byte, or as a special case of extra operand.
  • eop: Either follows the opcode and mod/rm byte, if present, or implicit.

2. The mod/rm encoding

The way x86 manages these operands is different depending on the instruction. Most of the time instructions encodings use a special byte, following the opcode, called the mod/rm byte, which stores information about upto two operands. One operand of the pair is the rx operand, the other one is the rm operand.

The format of the mod/rm byte is as follows:

2 bits 3 bits 3 bits
mod rx rm

This is where the terminology of the two types of operands comes into place. rx operands are typically encoded in the rx field of the mod/rm byte, or in the opcode, and rm operands are always encoded in the rm field. There are a few exceptions in case of implicit operands, but for now think of these operands as if they had came from the mod/rm byte.

Sometimes mod/rm is used with one operand instructions. In this case the rx field contains a 3-bit extension to the opcode and rm field stores information about the sole operand of the instruction.

If the instruction has no operands, the mod/rm byte will typically not be present. But sometimes it is present and fully extends the opcode. It may become confusing, but for now this shouldn't come up.

The mod field contains the interpretation of the rm operand, as well as the size of displacement. The rx operand is always a register operand, so the rx field directly stores the register number (i.e. 0 for ax, 1 for cx and so on). The rm field can refer to either a register or a memory operand, dictated by mod operand.

The memory operands under 8086 can be one of the following:

  • [disp]
  • [base (+disp)]
  • [base + index (+disp)]

Where disp is an 8-bit or a 16-bit signed integer, and base and index are registers. The base register can be one of bx, bp, si, di, and the index register can be either si or di (they can't repeat, so no SI+DI addressing). When base or index registers are used an optional displacement can be used. Here's how mod field decides the displacement as well as interpretation of rm:

mod Interpretation
0b00 Memory, no displacement (*)
0b01 Memory, 8-bit displacement
0b10 Memory, 16 bit displacement
0b11 Register

If mod is 0b11, the rm field stores the register number. Otherwise rm represents a pair of registers base+index or base depending on its value. When mod represents a memory operand, rm will represent a pair of base and index registers, depending on its value:

rm base + index
0b000 bx + si
0b001 bx + di
0b010 bp + si
0b011 bp + di
0b100 si
0b101 di
0b110 bp (*)
0b111 bx

(*) If mod is 0b00 and rm is 0b110, the memory is represented as [disp16], without base and index registers. This also means that [bp] address is not encodeable directly and needs to be encoded as [bp+0].

Here's the data flow diagram for how mod/rm byte can be decoded in case of a 2-operand instruction:

image

In the diagram you can see the basic logic behind mod/rm decoding. For reference, here's a snippet of how 16-bit mod/rm byte could be decoded (the ds parameter is going to be always 16, because in 8086 you can't override the data size):

ModRM_Byte :: bit_field u8 {
    rm:  u8 | 3,
    rx:  u8 | 3,
    mod: u8 | 2,
}

rx_gpreg :: proc(size: u8, reg: u8) -> RX_Op {
    return RX_Op {
        kind = .GPReg,
        size = size,
        reg = reg,
    }
}

rm_gpreg :: proc(size: u8, reg: u8) -> RM_Op {
    return RM_Op {
        kind = .GPReg,
        size = size,
        reg = reg,
    }
}

rm_mem16 :: proc(size: u8, base_reg: u8, index_reg: u8, disp: i32) -> RM_Op {
    return RM_Op {
        kind = .Mem_Addr16,
        size = 2,
        base_reg = base_reg,
        index_reg = index_reg,
        scale = 1,
        disp = disp,
    }
}

decode_modrm_addr16 :: proc(bytes: []u8, modrm: ModRM_Byte, ds: u8) -> (RX_Op, RM_Op, int, bool) {
    addr16_rm_table := []struct { base: u8, index: u8, } {
        { base = REG_BX, index = REG_SI },
        { base = REG_BX, index = REG_DI },
        { base = REG_BP, index = REG_SI },
        { base = REG_BP, index = REG_DI },
        { base = REG_SI, index = REG_NONE },
        { base = REG_DI, index = REG_NONE },
        { base = REG_BP, index = REG_NONE },
        { base = REG_BX, index = REG_NONE },
    }
    rx_op := rx_gpreg(ds, modrm.rx)
    modrm_size := 0
    // Early return on mod=0b11
    if modrm.mod == 0b11 {
        rm_op := rm_gpreg(ds, modrm.rm)
        return rx_op, rm_op, modrm_size, true
    }
    entry := addr16_rm_table[modrm.rm]
    base, index := entry.base, entry.index
    // Find out displacement size
    disp_size := 0
    switch modrm.mod {
    case 0b00:
        if modrm.rm == 0b110 {
            base = REG_NONE
            index = REG_NONE
            disp_size = 2
        }
    case 0b01: disp_size = 1
    case 0b10: disp_size = 2
    }
    if len(bytes) < disp_size {
        return {}, {}, 0, false
    }
    // Parse displacement
    disp := i32(0)
    switch disp_size {
    case 1: disp = cast(i32) ((cast(^i8) &bytes[modrm_size])^)
    case 2: disp = cast(i32) ((cast(^i16le) &bytes[modrm_size])^)
    }
    modrm_size += disp_size
    rm_op := rm_mem16(ds, base, index, disp)
    return rx_op, rm_op, modrm_size, true
}

Instruction encoding

Let's look at how x86 encodes various mov instructions. Here are the ones we're interested in right now.

Encoding Mnemonic
89 /r mov r/m16, r16
8b /r mov r16, r/m16
b8+ imm16 mov r16, imm16
a1 mov ax, moffs16
a3 mov moffs16, ax
c7 /0 mov r/m16, imm16

First, we'll look at the "Encoding" column. The format of the column is opcode[+] [/modrm] [extra operand]. If the opcode is followed by the + sign, it means that the RX operand of the instruction is encoded in the low 3 bits of the encoding. So b8 would have AX as the operand, b9 would have CX as the operand and so on.

The /[something] syntax specifies the mod/rm byte interpretation. First of all, it means that instruction has a mod/rm byte. In the section above we discussed that mod/rm can encode two operands, rx and rm. If the mod/rm is spelled as /r, that means the opcode has a mod/rm byte that encodes two operands. If it's written as /n, where n is a number between 0 and 7, then this says that an instruction has a mod/rm byte but it does not have the rx operand. Instead those 3 bits are used to extend the opcode. You can also see how in the "Mnemonic" column for c7 /0, there's only one rm operand.

An instruction can only have one extra operand. The extra operands in the table above are imm16 (16-bit immediate) and moffs16 (16-bit memory offset, aka displacement). A keen reader would notice that you don't need a separate encoding for mov ax, moffs16, because mov rx, rm encoding handles this case, however the reason this encoding exists is for us to be able to save one extra mod/rm byte when we're moving from a specific offset into ax. Which I'm not sure is useful, but whatever.

All you need is encoding

The next thing I would like to show is that the "mnemonic" column is not needed to decode an instruction. The reason I'm doing that is to emphasize how simple the encoding really is and to set us up for future, where we table drive our decoder. The way we're going to do that is we're going to take the table above and look at the differences, then either move a part from right side of the column to the left side, or make a simplifying assumption.

Mnemonic

First, we'll move the mnemonic to the other side of the table, that's how we'll tell which mnemonic to use for a given encoding.

Encoding Operands
mov 89 /r r/m16, r16
mov 8b /r r16, r/m16
mov b8+ imm16 r16, imm16
mov a1 ax, moffs16
mov a3 moffs16, ax
mov c7 /0 r/m16, imm16

Direction

Let's look at the following table entries:

Encoding Operands
mov 89 /r r/m16, r16
mov 8b /r r16, r/m16

If you flip the 3rd bit of the opcode, you would change the ordering of the operands. This is called the direction bit, which is present in some of the opcodes that use the mod/rm byte. We cant represent the direction bits directly, so let's add another concept to our encoding column. Encoding flags, that will tell us some additional information about the direction.

The rm operand is the first (destination) by default, and rx as the second (source) by default. If the direction flag is present the order of the operands is flipped. This matches the direciton bit in the opcode (see mov 89, direction bit not set).

We'll have all encoding flags starting with +, and for direction bit we'll define a +d flag. Our table would look like this:

Encoding Operand List
mov 89 /r r/m16, r16
mov 8b /r +d r/m16, r16

Implicit operands

Now let's look at the following encodings:

Encoding Operand List
mov a1 disp +d moffs16, ax
mov a3 disp moffs16, ax

I had added disp as an extra operand, to make it clear that it is present. I will explain why this is not a "disp16" in the next chapters, but for now let's just accept that we don't need to know what it is when specifying the encoding. Ideally, we want to say that ax is the rx operand, disp16 is the rm operand and that a direction bit can flip the two, which I also added into the table.

Now all that's left to do is to say that rx implicitly represents the ax register, which we'll do with a new flag: +rx=ax. This would make the table above look like this:

Encoding Operand List
mov a1 disp16 +rx=ax +d moffs16, ax
mov a3 disp16 +rx=ax moffs16, ax

With that you can deduce the contents of the right column from the left column. So the right column will be removed and whatever's left in the left column is all of the information we need to decode an instruction.

Final touches

Encoding
mov 89 /gr
mov 8b /gr +d
mov b8+ imm16
mov a1 disp +rx=ax +d
mov a3 disp +rx=ax
mov c7 /0

This iteration replaces /r with /gr (stands for General-purpose Register), and that's pretty much it. The reason for the change will be discussed later, but note that there are also mov instructions that encode segment registers in mod/rm byte.

The table we had just created can be parsed with tools, and compiled into code / data for our disassembler that we can use to table-drive our decoding. However right now the patterns may still be unclear so we don't rush in with the table right now. Since we want to extend our table a little bit more, before we start hardcoding it.

To recap, we added instruction flags for direction bit and implicit operands, and the mnemonic to the encoding column, and this is all we needed to do to eliminate differences between the two columns! This sounds pretty powerful if you ask me, and we will definately use this style of notation going forward, and to create our decoding tables.

Clone this wiki locally