-
Notifications
You must be signed in to change notification settings - Fork 0
01. Core ideas in 8086 encoding
Ever since Odin has gotten the long-awaited bit_field
feature I started to want to return to
working on this disassembler project. The reason being, when I first started working on it, I
wanted to make it fast and to make it fast I wanted to compress my decoding tables, which resulted
in me simulating bitfields with procedures, which was a lot of effort, especially when I needed
to change the amount of bits allocated per field. There were also some scenarios where having
bitfields would be "nice", for example not having to read out individual bits of Mod/RM bytes
and REX prefixes. In any case, now I feel like writing a dissassembler is a bit easier than
it was before.
The other part of the reason why I returned was one of the favours I received to break down and explain my table-driven x86 decoder. I like explaining stuff, so here we go.
The series of articles posted here are aimed to explain the x86 decoding to folks who are curious how the x86 encoding is organized and how one can use the hidden assumptions to simplify the process of decoding. There are very few resources that can break it down in an easy-to-digest format. The reason for that is quite obvious: x86 encoding seems complicated. Well, at first anyway. Understanding the x86 encoding on a deep enough level to explain it to someone else well is a serious time dedication.
The way we're going to study the x86 encodings will be straightforward. We'll from the simplest possible subset of x86, in 16-bit code. At the start we'll only consider 8086 and gradually build up our understanding of the encoding as we're introducing new concepts. During the course of these articles I will be introducing my own terminology, because, well, intels' is confusing and verbose.
The x86 instruction set architecture has evolved over the years, in a backwards compatible way. What's interesting to us is that within each new instruction set added into x86, the ideas about encoding stay consistent across all instructions, with some exceptions. Therefore studying one instruction set also sets us up to understanding newer instruction sets.
For our disassembler we will be covering the following instruction sets:
- 16-bit mode
- 32-bit mode
- FPU instructions
- 64-bit mode
- MMX
- SSE, SSE2, SSE3, SSSE3
- AVX, AVX2
- Maybe some other extensions...
AVX512 will be left out on purpose.
This can sound like a long list, but really, we only have to care about the first two modes, the 64-bit mode, SSE and AVX, because those are the major points where new ideas are introduced in any major capacity. All other instruction sets are simple extensions of the original ideas.
x86 instructions are encoded using variable-length byte stream, meaning a single instruction can take up a variable amount of bytes in memory. An instruction encoding consists of several parts:
- Optional prefix bytes
- Opcode (upto 3 bytes)
- Mod/RM byte
- Immediate data
In terms of disassebling here's what that means. We first would parse the prefix bytes, independent of the kind of instruction we're going to encounter next. Then we read the opcode and based on the opcode determine whether mod/rm byte is present and what kind of immediate data follows the instruction.
In 8086, an instruction can have upto three operands. We'll give each one of these operands a name, and whatever encoding we'll encounter, it's generally going to fall into one of these types. The operands kinds are:
-
rx
(A register operand) -
rm
(A register or memory operand) -
eop
(Extra operand)
An instruction can not have two operands of the same kind. The reason we make this distinction is to greatly simplify our in-memory representation of the parsed instruction and more easily define the encoding table. Because these operand types have a strict ordering within an instruction.
Syntax | Operand Order |
---|---|
Intel |
rm , rx , eop
|
AT&T |
eop , rx , rm
|
Note that rm
is the default destination for instructions, and rx
is the default source.
Depending on instruction encoding, these two operands can swap places, so rm
will be the source
and rx
will be the destination.
These operands also have specific places where they are encoded in the instruction:
-
rx
: Within the low 3 bits of the opcode, in therx
field of the mod/rm byte or it's implicit. -
rm
: Within therm
field of the mod/rm byte, or as a special case of extra operand. -
eop
: Either follows the opcode and mod/rm byte, if present, or implicit.
The way x86 manages these operands is different depending on the instruction. Most of the time
instructions encodings use a special byte, following the opcode, called the mod/rm byte, which
stores information about upto two operands. One operand of the pair is the rx
operand, the other
one is the rm
operand.
The format of the mod/rm byte is as follows:
2 bits | 3 bits | 3 bits |
---|---|---|
mod | rx | rm |
This is where the terminology of the two types of operands comes into place. rx
operands are
typically encoded in the rx
field of the mod/rm byte, or in the opcode, and rm
operands
are always encoded in the rm
field. There are a few exceptions in case of implicit operands,
but for now think of these operands as if they had came from the mod/rm byte.
Sometimes mod/rm is used with one operand instructions. In this case the rx
field contains
a 3-bit extension to the opcode and rm
field stores information about the sole operand
of the instruction.
If the instruction has no operands, the mod/rm byte will typically not be present. But sometimes it is present and fully extends the opcode. It may become confusing, but for now this shouldn't come up.
The mod
field contains the interpretation of the rm
operand, as well as the size of
displacement. The rx
operand is always a register operand, so the rx
field directly stores the
register number (i.e. 0 for ax
, 1 for cx
and so on). The rm
field can refer to
either a register or a memory operand, dictated by mod
operand.
The memory operands under 8086 can be one of the following:
[disp]
[base (+disp)]
[base + index (+disp)]
Where disp
is an 8-bit or a 16-bit signed integer, and base
and index
are
registers. The base
register can be one of bx
, bp
, si
, di
, and the
index
register can be either si
or di
(they can't repeat, so no
SI+DI
addressing). When base or index registers are used an optional
displacement can be used. Here's how mod
field decides the displacement as
well as interpretation of rm
:
mod |
Interpretation |
---|---|
0b00 | Memory, no displacement (*) |
0b01 | Memory, 8-bit displacement |
0b10 | Memory, 16 bit displacement |
0b11 | Register |
If mod
is 0b11
, the rm
field stores the register number. Otherwise rm
represents a pair of registers base+index
or base
depending on its value.
When mod
represents a memory operand, rm
will represent a pair of base
and index
registers, depending on its value:
rm |
base + index
|
---|---|
0b000 |
bx + si
|
0b001 |
bx + di
|
0b010 |
bp + si
|
0b011 |
bp + di
|
0b100 | si |
0b101 | di |
0b110 |
bp (*) |
0b111 | bx |
(*) If mod is 0b00
and rm
is 0b110
, the memory is represented as
[disp16]
, without base and index registers. This also means that [bp]
address is not encodeable directly and needs to be encoded as [bp+0]
.
Here's the data flow diagram for how mod/rm byte can be decoded in case of a 2-operand instruction:
In the diagram you can see the basic logic behind mod/rm decoding. For reference, here's a snippet
of how 16-bit mod/rm byte could be decoded (the ds
parameter is going to be always 16, because
in 8086 you can't override the data size):
ModRM_Byte :: bit_field u8 {
rm: u8 | 3,
rx: u8 | 3,
mod: u8 | 2,
}
rx_gpreg :: proc(size: u8, reg: u8) -> RX_Op {
return RX_Op {
kind = .GPReg,
size = size,
reg = reg,
}
}
rm_gpreg :: proc(size: u8, reg: u8) -> RM_Op {
return RM_Op {
kind = .GPReg,
size = size,
reg = reg,
}
}
rm_mem16 :: proc(size: u8, base_reg: u8, index_reg: u8, disp: i32) -> RM_Op {
return RM_Op {
kind = .Mem_Addr16,
size = 2,
base_reg = base_reg,
index_reg = index_reg,
scale = 1,
disp = disp,
}
}
decode_modrm_addr16 :: proc(bytes: []u8, modrm: ModRM_Byte, ds: u8) -> (RX_Op, RM_Op, int, bool) {
addr16_rm_table := []struct { base: u8, index: u8, } {
{ base = REG_BX, index = REG_SI },
{ base = REG_BX, index = REG_DI },
{ base = REG_BP, index = REG_SI },
{ base = REG_BP, index = REG_DI },
{ base = REG_SI, index = REG_NONE },
{ base = REG_DI, index = REG_NONE },
{ base = REG_BP, index = REG_NONE },
{ base = REG_BX, index = REG_NONE },
}
rx_op := rx_gpreg(ds, modrm.rx)
modrm_size := 0
// Early return on mod=0b11
if modrm.mod == 0b11 {
rm_op := rm_gpreg(ds, modrm.rm)
return rx_op, rm_op, modrm_size, true
}
entry := addr16_rm_table[modrm.rm]
base, index := entry.base, entry.index
// Find out displacement size
disp_size := 0
switch modrm.mod {
case 0b00:
if modrm.rm == 0b110 {
base = REG_NONE
index = REG_NONE
disp_size = 2
}
case 0b01: disp_size = 1
case 0b10: disp_size = 2
}
if len(bytes) < disp_size {
return {}, {}, 0, false
}
// Parse displacement
disp := i32(0)
switch disp_size {
case 1: disp = cast(i32) ((cast(^i8) &bytes[modrm_size])^)
case 2: disp = cast(i32) ((cast(^i16le) &bytes[modrm_size])^)
}
modrm_size += disp_size
rm_op := rm_mem16(ds, base, index, disp)
return rx_op, rm_op, modrm_size, true
}
Let's look at how x86 encodes various mov
instructions. Here are the ones we're interested in right now.
Encoding | Mnemonic |
---|---|
89 /r |
mov r/m16, r16 |
8b /r |
mov r16, r/m16 |
b8+ imm16 |
mov r16, imm16 |
a1 |
mov ax, moffs16 |
a3 |
mov moffs16, ax |
c7 /0 |
mov r/m16, imm16 |
First, we'll look at the "Encoding" column. The format of the column is opcode[+] [/modrm] [extra operand]
. If the opcode is followed by the +
sign, it means that the RX operand of the instruction is encoded in the low 3 bits of the encoding. So b8
would have AX as the operand, b9
would have CX as the operand and so on.
The /[something]
syntax specifies the mod/rm byte interpretation. First of all, it means that instruction has a mod/rm byte. In the section above we discussed that mod/rm can encode two operands, rx
and rm
. If the mod/rm is spelled as /r
, that means the opcode has a mod/rm byte that encodes two operands. If it's written as /n
, where n
is a number between 0 and 7, then this says that an instruction has a mod/rm byte but it does not have the rx
operand. Instead those 3 bits are used to extend the opcode. You can also see how in the "Mnemonic" column for c7 /0
, there's only one rm
operand.
An instruction can only have one extra operand. The extra operands in the table above are imm16
(16-bit immediate) and moffs16
(16-bit memory offset, aka displacement). A keen reader would notice that you don't need a separate encoding for mov ax, moffs16
, because mov rx, rm
encoding handles this case, however the reason this encoding exists is for us to be able to save one extra mod/rm byte when we're moving from a specific offset into ax
. Which I'm not sure is useful, but whatever.
The next thing I would like to show is that the "mnemonic" column is not needed to decode an instruction. The reason I'm doing that is to emphasize how simple the encoding really is and to set us up for future, where we table drive our decoder. The way we're going to do that is we're going to take the table above and look at the differences, then either move a part from right side of the column to the left side, or make a simplifying assumption.
First, we'll move the mnemonic to the other side of the table, that's how we'll tell which mnemonic to use for a given encoding.
Encoding | Operands |
---|---|
mov 89 /r |
r/m16, r16 |
mov 8b /r |
r16, r/m16 |
mov b8+ imm16 |
r16, imm16 |
mov a1 |
ax, moffs16 |
mov a3 |
moffs16, ax |
mov c7 /0 |
r/m16, imm16 |
Let's look at the following table entries:
Encoding | Operands |
---|---|
mov 89 /r |
r/m16, r16 |
mov 8b /r |
r16, r/m16 |
If you flip the 3rd bit of the opcode, you would change the ordering of the operands. This is called the direction bit, which is present in some of the opcodes that use the mod/rm byte. We cant represent the direction bits directly, so let's add another concept to our encoding column. Encoding flags, that will tell us some additional information about the direction.
The rm
operand is the first (destination) by default, and rx
as the second (source) by default. If the direction flag is present the order of the operands is flipped. This matches the direciton bit in the opcode (see mov 89
, direction bit not set).
We'll have all encoding flags starting with +
, and for direction bit we'll define a +d
flag. Our table would look like this:
Encoding | Operand List |
---|---|
mov 89 /r |
r/m16, r16 |
mov 8b /r +d |
r/m16, r16 |
Now let's look at the following encodings:
Encoding | Operand List |
---|---|
mov a1 disp +d |
moffs16, ax |
mov a3 disp |
moffs16, ax |
I had added disp
as an extra operand, to make it clear that it is present. I will explain why this is not a "disp16" in the next chapters, but for now let's just accept that we don't need to know what it is when specifying the encoding. Ideally, we want to say that ax
is the rx
operand, disp16
is the rm
operand and that a direction bit can flip the two, which I also added into the table.
Now all that's left to do is to say that rx
implicitly represents the ax
register, which we'll do with a new flag: +rx=ax
. This would make the table above look like this:
Encoding | Operand List |
---|---|
mov a1 disp16 +rx=ax +d |
moffs16, ax |
mov a3 disp16 +rx=ax |
moffs16, ax |
With that you can deduce the contents of the right column from the left column. So the right column will be removed and whatever's left in the left column is all of the information we need to decode an instruction.
Encoding |
---|
mov 89 /gr |
mov 8b /gr +d |
mov b8+ imm16 |
mov a1 disp +rx=ax +d |
mov a3 disp +rx=ax |
mov c7 /0 |
This iteration replaces /r
with /gr
(stands for General-purpose Register), and that's pretty much it. The reason for the change will be discussed later, but note that there are also mov
instructions that encode segment registers in mod/rm byte.
The table we had just created can be parsed with tools, and compiled into code / data for our disassembler that we can use to table-drive our decoding. However right now the patterns may still be unclear so we don't rush in with the table right now. Since we want to extend our table a little bit more, before we start hardcoding it.
To recap, we added instruction flags for direction bit and implicit operands, and the mnemonic to the encoding column, and this is all we needed to do to eliminate differences between the two columns! This sounds pretty powerful if you ask me, and we will definately use this style of notation going forward, and to create our decoding tables.