My university’s approach to teaching x86 assembly was, to put it mildly, outdated even when I first encountered it around 2008-2009. While 64-bit processors were becoming increasingly common, we were still grappling with DOS, real-mode, memory segmentation – concepts from a bygone era.
Despite the antiquated curriculum, I managed to grasp enough to decipher compiler outputs, a skill that proved surprisingly useful over the years. However, I’d never undertaken a significant x86 assembly project from scratch. With some unexpected downtime (thanks to a global pandemic), I decided to remedy that and delve into the world of assembly programming.
My focus shifted to x86-64 architecture, deliberately sidestepping legacy complexities irrelevant to modern systems. As my exploration deepened, I felt compelled to share my learning journey through tutorials on this blog, recognizing a potential demand for accessible content on Learn Assembly.
This series will center on crafting standard 64-bit Windows programs. Windows is my primary OS outside of work, and when you’re coding at the assembly level, the operating system becomes an unavoidable factor. My aim is to start as close to “bare metal” as possible – eschewing external libraries and interacting directly with the operating system for core functionalities.
In this introductory installment (yes, I’m embarking on a series, a decision I might later regret!), we’ll cover the essential tools, their usage, my perspective on assembly programming, and constructing what might be the most minimal yet functional Windows program.
Essential Tools for Assembly Programming
Two primary tools are indispensable for our learn assembly journey.
The Assembler: Translating Assembly to Machine Code
CPUs operate on machine code – a binary instruction set optimized for processor efficiency but utterly incomprehensible to humans. Assembly language serves as a human-readable counterpart to this machine code. An assembler is the program that bridges this gap, converting assembly language into executable machine code.
The x86-64 assembly language landscape isn’t governed by a single, universal standard. Numerous assemblers exist, each with its own nuances, feature sets, and syntax variations, although they share fundamental similarities. Therefore, assembler choice is crucial. Throughout this series, we’ll employ Flat Assembler (FASM). I favor FASM for its compact size, ease of acquisition and use, powerful macro capabilities, and integrated editor. This makes it an excellent choice for those who learn assembly.
The Debugger: Peering into Program Execution
A debugger is the second critical tool in our arsenal. Debuggers allow us to scrutinize the internal state of our programs during execution. While Visual Studio offers debugging capabilities, a standalone debugger excels when focusing on disassembly, memory inspection, and register examination. Historically, I’ve relied on OllyDbg for such tasks. However, OllyDbg lacks 64-bit support. Consequently, we’ll transition to WinDbg. The linked version is a modernized iteration of this classic tool with an improved interface. Alternatively, a non-Windows Store version is available here as part of the Windows 10 SDK. Ensure you selectively install only WinDbg during SDK installation. For our purposes, both versions are largely interchangeable when you learn assembly.
Assembly Programming Mindset
With our toolkit assembled, let’s delve into fundamental concepts crucial for anyone who wants to learn assembly. These tutorials assume a basic familiarity with languages like C or C++, meaning some concepts might be review for many readers.
Assembly Programming: The Big Picture
CPUs possess a limited repertoire of actions. An “instruction set” defines the complete set of operations a CPU is designed to perform. An “instruction” is simply one of these CPU-executable actions. Most instructions are parameterized and inherently simple. Examples include “write an 8-bit value to a specific memory location” or “multiply the 16-bit signed integer values in registers A and B, storing the result in register A.”
Here’s a simplified model of the architecture we’ll start with, essential for anyone seeking to learn assembly.
Simplified CPU Architecture Diagram Illustrating Registers, Memory, and Control Unit
This model omits considerable complexity (multi-core processors, cache hierarchies, etc.), but it provides a solid foundation. Effective low-level programming and debugging hinge on understanding how high-level concepts translate to this fundamental model. Grasping this mapping is key to learn assembly effectively.
Registers: CPU’s Speedy Scratchpad
Registers are specialized memory units embedded directly within the CPU. They are extremely small but offer lightning-fast access speeds. x86-64 architecture boasts numerous register types. For now, we’ll focus on sixteen general-purpose registers, each 64 bits wide. Notably, the lower byte, word, and double-word of each register can be addressed individually (1 word = 2 bytes, 1 double-word = 4 bytes).
Register | Lower byte | Lower word | Lower dword |
---|---|---|---|
rax | al | ax | eax |
rbx | bl | bx | ebx |
rcx | cl | cx | ecx |
rdx | dl | dx | edx |
rsp | spl | sp | esp |
rsi | sil | si | esi |
rdi | dil | di | edi |
rbp | bpl | bp | ebp |
r8 | r8b | r8w | r8d |
r9 | r9b | r9w | r9d |
r10 | r10b | r10w | r10d |
r11 | r11b | r11w | r11d |
r12 | r12b | r12w | r12d |
r13 | r13b | r13w | r13d |
r14 | r14b | r14w | r14d |
r15 | r15b | r15w | r15d |
Furthermore, the higher 8 bits of rax
, rbx
, rcx
, and rdx
are accessible as ah
, bh
, ch
, and dh
.
Despite being termed “general-purpose,” register usage isn’t entirely arbitrary. Certain instructions mandate specific registers, and some registers have designated roles. For instance, rsp
is the stack pointer, crucial for push
, pop
, call
, and ret
instructions. rsi
and rdi
serve as source and destination indices for string manipulation instructions. Multiplication instructions also exhibit register specificity, requiring one operand in rax
and writing the result to the rax:rdx
register pair.
Beyond these, rip
(instruction pointer) and rflags
(flags register) are special registers we’ll encounter. rip
stores the address of the next instruction, modified by control flow instructions like call
and jmp
. rflags
holds status flags reflecting program state, such as zero, sign, and carry flags resulting from arithmetic operations. Instruction behavior often depends on these flags, and many instructions update them. The flags register can also be read and written directly using dedicated instructions.
x86-64 boasts many more registers, primarily for SIMD and floating-point operations, which are outside the scope of this introductory learn assembly series.
Memory and Addresses: The Program’s Workspace
Imagine memory as a vast array of byte-sized “cells,” sequentially numbered from zero upwards. These numbers are “memory addresses.” Simple concept, right?
However, memory addressing was historically complex, especially in older x86 architectures. Early x86 processors used 16-bit registers, limiting direct addressability to 64 kilobytes. While hardware could handle 20-bit addresses, it relied on segment registers holding “base” addresses. Instructions used 16-bit offsets within these segments to calculate the final 20-bit “linear” address. Separate segment registers managed code, data, and stack segments, which could overlap, making memory management intricate.
Modern x86-64 simplifies this significantly. Segment registers for code, data, and stack still exist and are loaded with special values, but user-space programmers generally don’t need to manage them directly. For practical purposes, assume all segments start at address 0 and span the entire addressable memory range. Thus, in x86-64, programs perceive memory as a single, contiguous array of bytes with sequential addresses starting at 0 – the “flat memory model.” This flat memory model greatly simplifies the process when you learn assembly.
However, this is a simplification. Complete freedom to read and write any memory byte would lead to programs overwriting each other’s data and code – a real problem in older systems. Modern operating systems and CPUs implement protection mechanisms to prevent this. While in-depth exploration is OS developer territory, a brief overview is helpful:
Each process gets its own “virtual address space,” the flat address space described earlier. The OS manages a mapping between these virtual addresses and actual physical memory addresses for each process using paging. This mapping is enforced by hardware, dynamically translating virtual addresses to physical addresses during runtime. This is how process isolation is achieved – the same virtual address (e.g., 0x410F119C) can point to different physical memory locations in different processes.
Finally, it’s crucial to understand that instructions and the data they manipulate reside in the same memory. This is characteristic of the von Neumann architecture, contrasting with the Harvard architecture where instructions and data are in separate memories (e.g., AVR microcontrollers in Arduinos). This shared memory space is a fundamental aspect when you learn assembly on x86-64.
Your First Assembly Program: Minimal and Functional
Hopefully, you’ve downloaded FASM and are eager to write code. Our first program will be deliberately simple: it will load and immediately exit. The primary goal is to familiarize ourselves with the tools.
Here’s the assembly code for our first x86-64 program, a crucial step as you learn assembly:
format PE64 NX GUI 6.0
entry start
section '.text' code readable executable
start:
int3
ret
Code Breakdown: Line by Line
Let’s dissect this code line by line to understand the basics as you learn assembly.
format PE64 NX GUI 6.0
: This FASM directive specifies the output binary format – Portable Executable (PE) format, the standard for Windows programs. We’ll delve into PE format details later.entry start
: Defines the program’s entry point, the starting execution address. “start” is a label, a symbolic name for an address within our program. This line tells the assembler that program execution begins at the address marked by the “start” label. Labels can be referenced even before their definition in the code.section '.text' code readable executable
: This directive initiates a new section within the PE file, named ‘.text’, designated for executable code. Sections are fundamental to the PE format, as we’ll see later.start:
: This line defines the “start” label, marking the program’s entry point as specified in theentry
directive. Labels themselves don’t generate machine code; they are merely markers within the executable’s address space, essential for navigating your code as you learn assembly.int3
: This is a special instruction that triggers a debug exception handler. When executed under a debugger,int3
pauses the program, allowing inspection of its state and step-by-step execution. This is the mechanism behind breakpoints – debuggers replace an instruction byte with theint3
opcode. We’re hardcoding a breakpoint at the entry point for convenience, avoiding manual breakpoint setup in the debugger every time.ret
: Theret
instruction (return) pops an address from the stack top and transfers program execution to that address. In our case, it will return control to the OS code that initially launched our program.
Launch FASMW.EXE, paste this code into the editor, save the file, and press Ctrl+F9
. Congratulations, you’ve assembled your first assembly program! Now, let’s load it into a debugger and step through its execution to see it in action, a vital practice when you learn assembly.
Debugging Your First Program with WinDbg
Open WinDbg. In the “View” tab, ensure “Disassembly,” “Registers,” “Stack,” “Memory,” and “Command” windows are visible. Go to “File” > “Launch Executable” and select the executable you just built with FASM. Your WinDbg workspace should now resemble this:
WinDbg Initial View Before Program Execution
The disassembly window displays the code currently being executed. Initially, it’s OS loader code, responsible for loading our program into memory and eventually transferring control to our entry point. WinDbg starts with a breakpoint before any of this occurs.
The registers window shows the current values of the x86-64 registers we discussed earlier.
The memory window displays raw memory contents around a specified virtual address, useful for data inspection later on as you learn assembly.
The stack window shows the current call stack (initially within ntdll.dll, the Windows NT DLL).
The command window allows entering text commands and displays debugger output.
Pressing F5 at this point continues program execution until the next breakpoint. The next breakpoint will be the int3
we hardcoded. Press F5, and you should see something like this:
WinDbg View Paused at the INT3 Breakpoint in Our Program
You should recognize the int3
and ret
instructions from our code in the disassembly window. To execute the next instruction, press F8 (step over). Observe the registers window; the rip
register updates as you step through instructions (WinDbg highlights changed registers in red).
After executing ret
, program control returns to the OS code that invoked our program’s entry point.
WinDbg View After Stepping Over the RET Instruction
As shown, the next action is a call to RtlExitUserThread
, a function with a self-explanatory name. Pressing F5 now will allow your program’s main thread to terminate, and the program will exit. Or will it?
Using ret
is a shortcut. On Windows, process termination occurs under these conditions:
- A thread explicitly calls the WinAPI function
ExitProcess
. - All threads within the process have exited.
While our main thread is exiting, there’s no guarantee that Windows hasn’t initiated other background threads (for DLL loading, etc.) within our process. It appears that in this simple case, the main thread is indeed the only thread (checking confirms the process terminates). However, this behavior isn’t guaranteed and could change. A robust Windows program should always explicitly call ExitProcess
when intended to terminate.
To call WinAPI functions like ExitProcess
, we need to understand the Portable Executable file format, DLL loading, and calling conventions.
PE Format and DLL Imports: Calling External Functions
The ExitProcess
function resides within KERNEL32.DLL (KERNEL32 is indeed the 64-bit library name; 32-bit compatibility versions are in SysWOW64 – not a typo!). To use it, we must import it into our program.
We won’t exhaustively cover the Portable Executable (PE) format here. The Microsoft documentation provides comprehensive details. Key PE format aspects for our learn assembly journey include:
- PE files are structured into sections. We’ve already seen the ‘.text’ section for code. Sections can hold various data types.
- Import information – which symbols are imported from which DLLs – is stored in the ‘.idata’ section.
Let’s examine the ‘.idata’ section structure. As per the PE format documentation, the ‘.idata’ section starts with an Import Directory Table (IDT). Each IDT entry corresponds to a DLL, is 20 bytes long, and contains these fields:
- A 4-byte Relative Virtual Address (RVA) of the Import Lookup Table (ILT), holding imported function names.
- A 4-byte timestamp (typically 0).
- Forwarder chain index (typically 0).
- A 4-byte RVA of a null-terminated DLL name string.
- A 4-byte RVA of the Import Address Table (IAT). The IAT structure mirrors the ILT. However, the loader modifies the IAT at runtime, overwriting each entry with the actual address of the imported function. Theoretically, ILT and IAT can point to the same memory. Setting the ILT pointer to zero can also work, though official support is uncertain.
The IDT is terminated by a zero-filled entry. The ILT/IAT is an array of 64-bit values, null-terminated. The lower 31 bits of each entry are the RVA of a hint/name table entry containing the imported function name. During runtime, IAT entries are replaced with the actual function addresses.
The hint/name table contains entries, each even-boundary aligned. Each entry begins with a 2-byte hint (ignored for now) and a null-terminated imported function name string, followed by a null byte (if needed) for even alignment.
With this PE format understanding, let’s define our executable’s ‘.idata’ section in FASM:
section '.idata' import readable writeable
idt: ; import directory table starts here
; entry for KERNEL32.DLL
dd rva kernel32_iat
dd 0
dd 0
dd rva kernel32_name
dd rva kernel32_iat
; NULL entry - end of IDT
dd 5 dup(0)
name_table: ; hint/name table
_ExitProcess_Name:
dw 0
db "ExitProcess", 0, 0
kernel32_name:
db "KERNEL32.DLL", 0
kernel32_iat: ; import address table for KERNEL32.DLL
ExitProcess:
dq rva _ExitProcess_Name
dq 0 ; end of KERNEL32's IAT
The section
directive is familiar. Here, ‘.idata’ is for import data and must be writeable because imported function addresses will be written into it during program loading.
db
, dw
, dd
, and dq
directives emit raw byte, word, double-word, and quad-word values, respectively. The rva
operator yields the Relative Virtual Address of its operand. dd rva kernel32_iat
emits a 4-byte value equal to the RVA of the kernel32_iat
label.
We’ve used FASM directives to precisely describe the ‘.idata’ section content, a crucial step to learn assembly and interact with external libraries.
64-bit Windows Calling Convention: Function Communication
We’re nearing the point of calling ExitProcess
. But first, how do function calls actually work at the assembly level? The call
instruction pushes the current rip
value onto the stack and jumps to the target address. The ret
instruction pops an address from the stack and jumps there. However, argument passing and return value handling aren’t inherently defined by these instructions. These are governed by calling conventions, agreements between caller and callee. Conventions dictate rules like:
- Caller pushes arguments onto the stack (last to first).
- Callee removes parameters from the stack before returning.
- Callee places return values in the
eax
register.
A calling convention is such a set of rules. Many conventions exist. When calling functions in assembly, knowing the expected calling convention is essential.
Fortunately, 64-bit Windows primarily uses one calling convention: the Microsoft x64 calling convention. It’s more complex than older conventions, passing initial arguments in registers (for performance) rather than solely on the stack.
For our purposes, key aspects of the x64 calling convention are:
- Stack pointer must be 16-byte aligned.
- First four integer/pointer arguments are passed in registers
rcx
,rdx
,r8
, andr9
. First four floating-point arguments inxmm0
toxmm3
. Subsequent arguments are stack-passed. - Even for register-passed arguments, the caller must allocate 32 bytes of “shadow space” on the stack. This applies even with fewer than four arguments.
- The caller is responsible for stack cleanup.
Armed with this knowledge, we can finally call ExitProcess
:
format PE64 NX GUI 6.0
entry start
section '.text' code readable executable
start:
int3
sub rsp, 8 * 5 ; adjust stack ptr and allocate shadow space
xor rcx, rcx ; The first and only argument (exit code) in rcx.
call [ExitProcess]
section '.idata' import readable writeable
idt: ; import directory table starts here
; entry for KERNEL32.DLL
dd rva kernel32_iat
dd 0
dd 0
dd rva kernel32_name
dd rva kernel32_iat
; NULL entry - end of IDT
dd 5 dup(0)
name_table: ; hint/name table
_ExitProcess_Name:
dw 0
db "ExitProcess", 0, 0
kernel32_name:
db "KERNEL32.DLL", 0
kernel32_iat: ; import address table for KERNEL32.DLL
ExitProcess:
dq rva _ExitProcess_Name
dq 0 ; end of KERNEL32's IAT
New lines explained as you continue to learn assembly:
sub rsp, 8 * 5
: Thesub
instruction subtracts the second operand from the first, storing the result in the first. Here, we subtract 40 from the stack pointer (rsp
). Stacks grow downwards in memory, so subtraction allocates stack space. This line aligns the stack to a 16-byte boundary and allocates shadow space for the first four arguments simultaneously. Before our entry point, the stack pointer was 16-byte aligned. Thecall
instruction pushed a return address (8 bytes), misaligning it. Subtracting another 8 bytes re-aligns it, and 32 more bytes account for shadow space, hence 40 (5 * 8).xor rcx, rcx
: The first integer argument forExitProcess
(the exit code) is passed inrcx
.xor rcx, rcx
efficiently setsrcx
to zero (bitwise XOR with itself always results in zero). We’re setting the exit code to 0.call [ExitProcess]
: This line finally callsExitProcess
. Square brackets[]
denote indirection. Instead of calling the address of theExitProcess
label directly, the value at that memory address is used as the call target.ExitProcess
label points to the import table entry where the loader has placed the actual address of theExitProcess
function from KERNEL32.DLL.
Run this in WinDbg, step through it, and observe the registers (rsp
and rcx
) as ExitProcess
is called.
WinDbg View Stepping Through the Call to ExitProcess
That concludes this first part of our learn assembly series. Next time, we’ll move beyond simple program exit and explore more interesting assembly programming concepts.
Enjoyed this post? Follow me on bluesky for more!