Skip to content

01. Core ideas in 8086 encoding

flysand7 edited this page Sep 21, 2024 · 6 revisions

Introduction

So, you want to know about x86 encoding, right? You've come to the right place. It's surprisingly painful to scroll through the manuals and blogposts that are very abstract and hard to read due to terse language.

Which is why I decided to put my thoughts on paper (or at least, on the screen), to make a series of simple explanations that will distill a lot of ideas from x86 encoding, and make you better equipped to understand how x86 disasm's table decoder (as well as maybe some other disassemblers, such as nasm's) works.

At the end of these series of posts we'll build our own model for encoding that translates nicely to a table-driven decoder, and we'll understand some of the patterns x86 use for their instruction encoding.

I will assume basic familiarity with x86 instruction set. You don't need to know how to write assembly, nor you need to know the intricacies of the x86 instruction set.

1. Backwards-compatibility is our friend

We don't need to know much about the x86 architecture to work on a disassembler. History is boring anyways. Let's compress everything that's happened into one image.

image

The x86 instruction set architecture has evolved over the years, in a backwards compatible way. Instructions weren't replaced, instead they were added, or locked behind different CPU modes or reserved flags. Which makes each feature set fairly orthogonal.

The x86 instruction set feels like a fridge, and in that fridge, the food is stuffed to the extremes, with almost no empty space left. The milk is in a top shelf, horizontally, a pot of soup is on top of it and tomatoes scattered around in random places, wherever there's empty space. Occasionally you find dirty socks laying around, because someone added them by mistake and they can't remove them because they usually lie there, what if the person can't find them there anymore?

All this to say, that whatever gets into x86 stays in x86 and doesn't switch places, most of the time. So for our purposes we can build x86 starting from its simplest subset and build it up.

This allows us to keep the project simple, and grow it step-by-step, refining our assumptions and code with each step. And each subsequent step will be easier than the previous one, because, well x86 does reuse ideas a lot.

2. Instruction encoding

In x86 there's no fixed number of bytes per instruction. Usually, an instruction encoding consists of several parts:

image

So, how do we disassemble this? The prefixes are fixed and are the same for every instruction, so we can parse them regardless of which instruction we're parsing. Then we read the opcode, and make a table lookup, see what other fields should be present for this opcode and parse those fields. And voila!

Operand encoding

In 8086, an instruction can have upto three operands. We'll give each one of these operands a name, and whatever encoding we'll encounter, it's generally going to fall into one of these types:

  • rx (A register operand)
  • rm (A register or memory operand)
  • eop (Extra operand)

An instruction can not have two operands of the same kind. The reason we make this distinction is to greatly simplify our in-memory representation of the parsed instruction and more easily define the encoding table. Because these operand types have a strict ordering within an instruction.

Syntax Operand Order
Intel rm, rx, eop
AT&T eop, rx, rm

If you ever felt that the difference between Intel and AT&T order of operands is confusing, well here it is. It's literally just backwards from intel.

Note that rm is the default destination for instructions, and rx is the default source. Depending on instruction encoding, these two operands can swap places, so rm will be the source and rx will be the destination.

These operands also have specific places where they are encoded in the instruction:

  • rx: Within the low 3 bits of the opcode, in the rx field of the mod/rm byte or it's implicit.
  • rm: Within the rm field of the mod/rm byte, or as a special case of extra operand.
  • eop: Either follows the opcode and mod/rm byte, if present, or implicit.

2. The mod/rm encoding

The way x86 manages these operands is different depending on the instruction. Most of the time instructions encodings use a special byte, following the opcode, called the mod/rm byte, which stores information about upto two operands. One operand of the pair is the rx operand, the other one is the rm operand.

The format of the mod/rm byte is as follows:

2 bits 3 bits 3 bits
mod rx rm

This is where the terminology of the two types of operands comes into place. rx operands are typically encoded in the rx field of the mod/rm byte, or in the opcode, and rm operands are always encoded in the rm field. There are a few exceptions in case of implicit operands, but for now think of these operands as if they had came from the mod/rm byte.

Sometimes mod/rm is used with one operand instructions. In this case the rx field contains a 3-bit extension to the opcode and rm field stores information about the sole operand of the instruction.

If the instruction has no operands, the mod/rm byte will typically not be present. But sometimes it is present and fully extends the opcode. It may become confusing, but for now this shouldn't come up.

The mod field contains the interpretation of the rm operand, as well as the size of displacement. The rx operand is always a register operand, so the rx field directly stores the register number (i.e. 0 for ax, 1 for cx and so on). The rm field can refer to either a register or a memory operand, dictated by mod operand.

The memory operands under 8086 can be one of the following:

  • [disp]
  • [base (+disp)]
  • [base + index (+disp)]

Where disp is an 8-bit or a 16-bit signed integer, and base and index are registers. The base register can be one of bx, bp, si, di, and the index register can be either si or di (they can't repeat, so no SI+DI addressing). When base or index registers are used an optional displacement can be used. Here's how mod field decides the displacement as well as interpretation of rm:

mod Interpretation
0b00 Memory, no displacement (*)
0b01 Memory, 8-bit displacement
0b10 Memory, 16 bit displacement
0b11 Register

If mod is 0b11, the rm field stores the register number. Otherwise rm represents a pair of registers base+index or base depending on its value. When mod represents a memory operand, rm will represent a pair of base and index registers, depending on its value:

rm base + index
0b000 bx + si
0b001 bx + di
0b010 bp + si
0b011 bp + di
0b100 si
0b101 di
0b110 bp (*)
0b111 bx

(*) If mod is 0b00 and rm is 0b110, the memory is represented as [disp16], without base and index registers. This also means that [bp] address is not encodeable directly and needs to be encoded as [bp+0].

Here's the data flow diagram for how mod/rm byte can be decoded in case of a 2-operand instruction:

image

In the diagram you can see the basic logic behind mod/rm decoding. For reference, here's a snippet of how 16-bit mod/rm byte could be decoded (the ds parameter is going to be always 16, because in 8086 you can't override the data size):

Parsed_ModRM :: struct {
    size: u8, // if 0, specifies register, otherwise memory
    base: u8,
    index: u8,
    scale: u8,
    disp: i32,
}

decode_modrm_addr16 :: proc(bytes: []u8, modrm: ModRM_Byte) -> (Parsed_ModRM, int, bool) {
    Addr16_RM_Entry :: struct {
        base: u8,
        index: u8,
    }
    addr16_rm_table := []Addr16_RM_Entry {
        { base = REG_BX, index = REG_SI },
        { base = REG_BX, index = REG_DI },
        { base = REG_BP, index = REG_SI },
        { base = REG_BP, index = REG_DI },
        { base = REG_SI, index = REG_NONE },
        { base = REG_DI, index = REG_NONE },
        { base = REG_BP, index = REG_NONE },
        { base = REG_BX, index = REG_NONE },
    }
    modrm_size := 0
    // Early return on mod=0b11
    if modrm.mod == 0b11 {
        parsed := Parsed_ModRM {
            size = 0,
            base = modrm.rm,
        }
        return parsed, modrm_size, true
    }
    entry := addr16_rm_table[modrm.rm]
    base := entry.base
    index := entry.index
    // Find out displacement size
    disp_size := 0
    switch modrm.mod {
    case 0b00:
        if modrm.rm == 0b110 {
            base = REG_NONE
            index = REG_NONE
            disp_size = 2
        }
    case 0b01: disp_size = 1
    case 0b10: disp_size = 2
    }
    if len(bytes) < disp_size {
        return {}, 0, false
    }
    // Parse displacement
    disp := i32(0)
    if disp_size == 1 {
        disp = cast(i32) ((cast(^i8) &bytes[modrm_size])^)
    } else if disp_size == 2 {
        disp = cast(i32) ((cast(^i16le) &bytes[modrm_size])^)
    }
    modrm_size += disp_size
    parsed := Parsed_ModRM {
        size = 2,
        base = base,
        index = index,
        scale = 1,
        disp = disp,
    }
    return parsed, modrm_size, true
}

3. The official encoding notation

If you look in the second volume of intel software developer's manual... wait no, I'll assume you're a lazy person. If you google "MOV x86" you'll see a website with a table like this:

Opcode Mnemonic Description
88 /r mov r/m8, r8 Move r8 to r/m8.
89 /r mov r/m16, r16 Move r16 to r/m16.
8a /r mov r8, r/m8 Move r/m8 to r8.
8b /r mov r16, r/m16 Move r/m16 to r16.
8c /r mov r/m16, Sreg** Move segment register to r/m16.
8e /r mov Sreg, r/m16** Move r/m16 to segment register.
a0 mov AL, moffs8* Move byte at (seg:offset) to AL.
a1 mov AX, moffs16* Move word at (seg:offset) to AX.
a2 mov moffs8*, AL Move AL to (seg:offset).
a3 mov moffs16*, AX Move AX to (seg:offset).
b0+ rb mov r8, imm8 Move imm8 to r8.
b8+ rw mov r16, imm16 Move imm16 to r16.
c6 /0 mov r/m8, imm8 Move imm8 to r/m8.
c7 /0 mov r/m16, imm16 Move imm16 to r/m16.

First, let me explain some of the terms that appear in this table:

/r and /[n]

The slash in the Opcode column specifies the presence of the mod/rm byte. That's pretty neat because we can tell whether we need a mod/rm byte right away.

What follows the slash is a description of how the rx field of that byte is used. In case /r is specified, the mod/rm byte encodes two operands: rx and rm. If the /[n] format is used (n is a digit between 0 and 7), then rx field is used as an opcode extension and rm stores the rm operand of the instruction.

Register types

You can see, even when a mod/rm byte is used, the types of registers are different. Consider:

Opcode Mnemonic Description
88 /r mov r/m8, r8 Move r8 to r/m8.
8b /r mov r16, r/m16 Move r/m16 to r16.
8c /r mov r/m16, Sreg Move segment register to r/m16.

One of these encodes an 8-bit general purpose register, the second one encodes a 16-bit general purpose register, and the third one encodes a segment register.

So, here's the takeaway we'll get from here: Just knowing whether mod/rm byte exists is not enough, and we also would like to store some infromation about what mod/rm byte encodes, to correctly interpret register types.

moffs and imm

Let's look at these now:

Opcode Mnemonic Description
b8+ rw mov r16, imm16 Move imm16 to r16.
a1 mov AX, moffs16* Move word at (seg:offset) to AX.

In the first case, the we have an extra operand, which is an immediate 16-bit integer. There's an rw following opcode, saying that word value follows the instruction.

In the second entry, we have a moffs16, specifying a 16-bit displacement that follows the opcode. Now, I don't have any idea why, but for some reason they don't put it in the Opcode column. Trully bizarre and weird.

The + sign?

There's also entries like this:

Opcode Mnemonic Description
b8+ rw mov r16, imm16 Move imm16 to r16.

We can see the immediate value, but where's r16 encoded at? Well, it's encoded in the low 3 bits of the opcode. If you will, we can expand this one single entry into this many entries:

Opcode Mnemonic Description
b8 rw mov ax, imm16 Move imm16 to AX.
b9 rw mov cx, imm16 Move imm16 to CX.
ba rw mov dx, imm16 Move imm16 to DX.
bb rw mov bx, imm16 Move imm16 to BX.
bc rw mov sp, imm16 Move imm16 to SP.
bd rw mov bp, imm16 Move imm16 to BP.
be rw mov di, imm16 Move imm16 to DI.
bf rw mov si, imm16 Move imm16 to SI.

I broke my hands right now typing this, so I'll keep the + shortcut, it's actually useful. But keep in mind, it's just a shorthand for a group of encodings. In an ideal table-based decoder we'll make separate entry for each one of these encodings.

4. Let's make our own table format

There are a few adjustments I want to make to the default way of representing encodings, one that has less duplicating information. Our ideal format would describe each encoding unambiguously,

Let's start by asking a simple question. What's really needed for the disassembler, to know how to "mark up" an instruction? That is, to tell which bytes are present and tell the general structure of the instruction, without going further to interpret its operands or its data size?

The answer to this question is:

  1. Opcode
  2. Presence of the mod/rm byte
  3. The size of the extra operand

Okay, let's put all of those into our table, then:

Encoding Opcode Mnemonic Description
88 / 88 /r mov r/m8, r8 Move r8 to r/m8.
89 / 89 /r mov r/m16, r16 Move r16 to r/m16.
8a / 8a /r mov r8, r/m8 Move r/m8 to r8.
8b / 8b /r mov r16, r/m16 Move r/m16 to r16.
8c / 8c /r mov r/m16, Sreg** Move segment register to r/m16.
8e / 8e /r mov Sreg, r/m16** Move r/m16 to segment register.
a0 disp a0 mov AL, moffs8* Move byte at (seg:offset) to AL.
a1 disp a1 mov AX, moffs16* Move word at (seg:offset) to AX.
a2 disp a2 mov moffs8*, AL Move AL to (seg:offset).
a3 disp a3 mov moffs16*, AX Move AX to (seg:offset).
b0+ imm8 b0+ rb mov r8, imm8 Move imm8 to r8.
b8+ imm b8+ rw mov r16, imm16 Move imm16 to r16.
c7/0 imm8 c6 /0 mov r/m8, imm8 Move imm8 to r/m8.
c7/0 imm c7 /0 mov r/m16, imm16 Move imm16 to r/m16.

I added the Encoding column in front, which contains parts of the notation that are useful for marking up an instruction. You can see that I used / to signify a presence of the mod/rm byte, and /n format to signify an opcode extension using a mod/rm byte. I also renamed rw and rb to more clear imm and imm8, and added disp as the extra operand to a0..a3 opcodes.

One stylistic choice I made is an opcode extension is present, the mod/rm byte is written without a space after the opcode byte. This is to make it look like an opcode extension even more, since the digit is also part of the opcode.

Now, let's think about how we can encode the operand types. Since we know that mod/rm contains rx and rm operands, we can just add the specification of their types into our table:

Encoding Opcode Mnemonic Description
88 / +rx=gr8 +rm=gr8 88 /r mov r/m8, r8 Move r8 to r/m8.
89 / +rx=gr16 +rm=gr16 89 /r mov r/m16, r16 Move r16 to r/m16.
8a / +rx=gr8 +rm=gr16 8a /r mov r8, r/m8 Move r/m8 to r8.
8b / +rx=gr16 +rm=gr16 8b /r mov r16, r/m16 Move r/m16 to r16.
8c / +rx=sr +rm=gr 8c /r mov r/m16, Sreg** Move segment register to r/m16.
8e / +rx=sr +rm=gr 8e /r mov Sreg, r/m16** Move r/m16 to segment register.
a0 disp +rx=0 a0 mov AL, moffs8* Move byte at (seg:offset) to AL.
a1 disp +rx=0 a1 mov AX, moffs16* Move word at (seg:offset) to AX.
a2 disp +rx=0 a2 mov moffs8*, AL Move AL to (seg:offset).
a3 disp +rx=0 a3 mov moffs16*, AX Move AX to (seg:offset).
b0+ imm8 +rx=gr b0+ rb mov r8, imm8 Move imm8 to r8.
b8+ imm +rx=gr b8+ rw mov r16, imm16 Move imm16 to r16.
c7/0 imm8 +rm=gr8 c6 /0 mov r/m8, imm8 Move imm8 to r/m8.
c7/0 imm +rm=gr16 c7 /0 mov r/m16, imm16 Move imm16 to r/m16.

The + symbol will signal to us an part of encoding that we'll use to interpret the marked-up instruction. gr means a general-purpose register, sr means a segment register. You can also see how I added #rx=0 where there's no mod/rm byte and an implicit "A" register. This way we've specified an implicit arameter.

When instruction specifies an opcode extension, we didn't specify rx size because there is no rx operand.

Now, specifying register sizes each time is a little bit verbose, especially when there's so many instructions where types of registers for rx and rm is the same.

I suggest we use /[register_type] notation to specify the type of rm, and assume rx is the same type as rm, unless specified otherwise. We'll also assume that if neither rx or rm kind is specified, they refer to general-purpose register.

Let's also assume that the operand size is 16 by default, and use +ds=8 to signal a different size of registers.

Now our table looks even more concise:

Encoding Opcode Mnemonic Description
88 /gr +ds=8 88 /r mov r/m8, r8 Move r8 to r/m8.
89 /gr 89 /r mov r/m16, r16 Move r16 to r/m16.
8a /gr +ds=8 8a /r mov r8, r/m8 Move r/m8 to r8.
8b /gr 8b /r mov r16, r/m16 Move r/m16 to r16.
8c /gr +rx=sr 8c /r mov r/m16, Sreg** Move segment register to r/m16.
8e /gr +rx=sr 8e /r mov Sreg, r/m16** Move r/m16 to segment register.
a0 disp +rx=0 a0 mov AL, moffs8* Move byte at (seg:offset) to AL.
a1 disp +rx=0 a1 mov AX, moffs16* Move word at (seg:offset) to AX.
a2 disp +rx=0 a2 mov moffs8*, AL Move AL to (seg:offset).
a3 disp +rx=0 a3 mov moffs16*, AX Move AX to (seg:offset).
b0+ imm +rx=gr +ds=8 b0+ rb mov r8, imm8 Move imm8 to r8.
b8+ imm +rx=gr b8+ rw mov r16, imm16 Move imm16 to r16.
c7/0 imm +ds=8 c6 /0 mov r/m8, imm8 Move imm8 to r/m8.
c7/0 imm c7 /0 mov r/m16, imm16 Move imm16 to r/m16.

Note how adding data size override also changed the mov r8, imm8 encoding. We no longer need to specify the size of the immediate directly, isntead, we can encode both the size of the register and the size of immediate in a single instruction.

The only thing that's left is to specify the order of parameters. You can probably note that opcodes with opposite operand orders differ by 1 bit at most (88 and 89, a0 and a1 etc). This in fact is called a direction bit for that reason, because within the opcode it specifies which order rx and rm operands go in.

However for our purposes we don't care which opcodes have this bit and which don't, and in which position it's located so we'll just say its the property of interpreting this opcode, and add the +d flag to specify that a direction bit is present.

If +d is not specified, that meahns rm is the first operand, and if it is specified that means rx comes first.

Encoding Opcode Mnemonic Description
88 /gr +ds=8 88 /r mov r/m8, r8 Move r8 to r/m8.
89 /gr 89 /r mov r/m16, r16 Move r16 to r/m16.
8a /gr +d +ds=8 8a /r mov r8, r/m8 Move r/m8 to r8.
8b /gr +d 8b /r mov r16, r/m16 Move r/m16 to r16.
8c /gr +rx=sr 8c /r mov r/m16, Sreg** Move segment register to r/m16.
8e /gr +d +rx=sr 8e /r mov Sreg, r/m16** Move r/m16 to segment register.
a0 disp +d +rx=0 a0 mov AL, moffs8* Move byte at (seg:offset) to AL.
a1 disp +d +rx=0 a1 mov AX, moffs16* Move word at (seg:offset) to AX.
a2 disp +rx=0 a2 mov moffs8*, AL Move AL to (seg:offset).
a3 disp +rx=0 a3 mov moffs16*, AX Move AX to (seg:offset).
b0+ imm +rx=gr +ds=8 b0+ rb mov r8, imm8 Move imm8 to r8.
b8+ imm +rx=gr b8+ rw mov r16, imm16 Move imm16 to r16.
c7/0 imm +ds=8 c6 /0 mov r/m8, imm8 Move imm8 to r/m8.
c7/0 imm c7 /0 mov r/m16, imm16 Move imm16 to r/m16.

Turns out that we've almost completely made our table a single-column table. All we need now is to move the mnemonic over to our encoding table and we can completely abandon the old intel table.

Encoding Mnemonic
mov 88 /gr +ds=8 mov r/m8, r8
mov 89 /gr mov r/m16, r16
mov 8a /gr +d +ds=8 mov r8, r/m8
mov 8b /gr +d mov r16, r/m16
mov 8c /gr +rx=sr mov r/m16, Sreg**
mov 8e /gr +d +rx=sr mov Sreg, r/m16**
mov a0 disp +d +rx=0 mov AL, moffs8*
mov a1 disp +d +rx=0 mov AX, moffs16*
mov a2 disp +rx=0 mov moffs8*, AL
mov a3 disp +rx=0 mov moffs16*, AX
mov b0+ imm +rx=gr +ds=8 mov r8, imm8
mov b8+ imm +rx=gr mov r16, imm16
mov c7/0 imm +ds=8 mov r/m8, imm8
mov c7/0 imm mov r/m16, imm16

The first column of the above table completely describes everything we need to decode and interpret an instruction. Let's try!

5. The table works

I'll give you one encoding, encoded using our table format, let's try to think what instruction it encodes. This should show that the table format we've come up is versatile enough:

lea 8d /gr +d

This encoding tells us that:

  1. Mnemonic is lea (load effective address)
  2. Opcode is 8d
  3. It has a mod/rm byte
  4. The type of rm operand is a general-purpose register
  5. The type of rx is unspecified, so it's also a general-purpose register.
  6. Data size is not set explicitly so it'll be a 16-bit operation.
  7. The direction flag is specified, meaning rx operand is first.

This encoding is for the lea r16, r/m16 instruction.

Let's try again.

add 81 /0 imm8 +ds=8

This encoding tells us that:

  1. Mnemonic is add (the one that adds numbers, yeah)
  2. Opcode is 81
  3. It has a mod/rm byte, and rx field should be 0.
  4. There's no need to specify a direction flag, since there's no rx operand.
  5. It has an extra 8-bit immediate operand.
  6. Since rm type is unspecified, we assume a general-purpose register
  7. Since ds=8 is defined, the size of the rm operand is 8-bits.

This encoding is for the add r/m16, imm8 instruction.


We have studied the intel's format for instruction encoding and came up with our own way of representing instructions, that is much more concise and ready to be used in a disassembler.

Coming up with a nice textual representation is useful, as it's the first step towards a table-driven decoder.

I dunno about you but I actually love the coniceness in it. I've seen some worse tables that you don't really know what to do. This one is relatively straightforward if you know the format.

Well, that's it for today.