Kickstart Your Journey to Learn Assembly Language: A Beginner's Guide to x86-64

My university’s approach to teaching x86 assembly was, to put it mildly, outdated even when I first encountered it around 2008-2009. While 64-bit processors were becoming increasingly common, we were still grappling with DOS, real-mode, memory segmentation – concepts from a bygone era.

Despite the antiquated curriculum, I managed to grasp enough to decipher compiler outputs, a skill that proved surprisingly useful over the years. However, I’d never undertaken a significant x86 assembly project from scratch. With some unexpected downtime (thanks to a global pandemic), I decided to remedy that and delve into the world of assembly programming.

My focus shifted to x86-64 architecture, deliberately sidestepping legacy complexities irrelevant to modern systems. As my exploration deepened, I felt compelled to share my learning journey through tutorials on this blog, recognizing a potential demand for accessible content on Learn Assembly.

This series will center on crafting standard 64-bit Windows programs. Windows is my primary OS outside of work, and when you’re coding at the assembly level, the operating system becomes an unavoidable factor. My aim is to start as close to “bare metal” as possible – eschewing external libraries and interacting directly with the operating system for core functionalities.

In this introductory installment (yes, I’m embarking on a series, a decision I might later regret!), we’ll cover the essential tools, their usage, my perspective on assembly programming, and constructing what might be the most minimal yet functional Windows program.

Essential Tools for Assembly Programming

Two primary tools are indispensable for our learn assembly journey.

The Assembler: Translating Assembly to Machine Code

CPUs operate on machine code – a binary instruction set optimized for processor efficiency but utterly incomprehensible to humans. Assembly language serves as a human-readable counterpart to this machine code. An assembler is the program that bridges this gap, converting assembly language into executable machine code.

The x86-64 assembly language landscape isn’t governed by a single, universal standard. Numerous assemblers exist, each with its own nuances, feature sets, and syntax variations, although they share fundamental similarities. Therefore, assembler choice is crucial. Throughout this series, we’ll employ Flat Assembler (FASM). I favor FASM for its compact size, ease of acquisition and use, powerful macro capabilities, and integrated editor. This makes it an excellent choice for those who learn assembly.

The Debugger: Peering into Program Execution

A debugger is the second critical tool in our arsenal. Debuggers allow us to scrutinize the internal state of our programs during execution. While Visual Studio offers debugging capabilities, a standalone debugger excels when focusing on disassembly, memory inspection, and register examination. Historically, I’ve relied on OllyDbg for such tasks. However, OllyDbg lacks 64-bit support. Consequently, we’ll transition to WinDbg. The linked version is a modernized iteration of this classic tool with an improved interface. Alternatively, a non-Windows Store version is available here as part of the Windows 10 SDK. Ensure you selectively install only WinDbg during SDK installation. For our purposes, both versions are largely interchangeable when you learn assembly.

Assembly Programming Mindset

With our toolkit assembled, let’s delve into fundamental concepts crucial for anyone who wants to learn assembly. These tutorials assume a basic familiarity with languages like C or C++, meaning some concepts might be review for many readers.

Assembly Programming: The Big Picture

CPUs possess a limited repertoire of actions. An “instruction set” defines the complete set of operations a CPU is designed to perform. An “instruction” is simply one of these CPU-executable actions. Most instructions are parameterized and inherently simple. Examples include “write an 8-bit value to a specific memory location” or “multiply the 16-bit signed integer values in registers A and B, storing the result in register A.”

Here’s a simplified model of the architecture we’ll start with, essential for anyone seeking to learn assembly.

Simplified CPU Architecture Diagram Illustrating Registers, Memory, and Control Unit

This model omits considerable complexity (multi-core processors, cache hierarchies, etc.), but it provides a solid foundation. Effective low-level programming and debugging hinge on understanding how high-level concepts translate to this fundamental model. Grasping this mapping is key to learn assembly effectively.

Registers: CPU’s Speedy Scratchpad

Registers are specialized memory units embedded directly within the CPU. They are extremely small but offer lightning-fast access speeds. x86-64 architecture boasts numerous register types. For now, we’ll focus on sixteen general-purpose registers, each 64 bits wide. Notably, the lower byte, word, and double-word of each register can be addressed individually (1 word = 2 bytes, 1 double-word = 4 bytes).

Register	Lower byte	Lower word	Lower dword
rax	al	ax	eax
rbx	bl	bx	ebx
rcx	cl	cx	ecx
rdx	dl	dx	edx
rsp	spl	sp	esp
rsi	sil	si	esi
rdi	dil	di	edi
rbp	bpl	bp	ebp
r8	r8b	r8w	r8d
r9	r9b	r9w	r9d
r10	r10b	r10w	r10d
r11	r11b	r11w	r11d
r12	r12b	r12w	r12d
r13	r13b	r13w	r13d
r14	r14b	r14w	r14d
r15	r15b	r15w	r15d

Furthermore, the higher 8 bits of rax, rbx, rcx, and rdx are accessible as ah, bh, ch, and dh.

Despite being termed “general-purpose,” register usage isn’t entirely arbitrary. Certain instructions mandate specific registers, and some registers have designated roles. For instance, rsp is the stack pointer, crucial for push, pop, call, and ret instructions. rsi and rdi serve as source and destination indices for string manipulation instructions. Multiplication instructions also exhibit register specificity, requiring one operand in rax and writing the result to the rax:rdx register pair.

Beyond these, rip (instruction pointer) and rflags (flags register) are special registers we’ll encounter. rip stores the address of the next instruction, modified by control flow instructions like call and jmp. rflags holds status flags reflecting program state, such as zero, sign, and carry flags resulting from arithmetic operations. Instruction behavior often depends on these flags, and many instructions update them. The flags register can also be read and written directly using dedicated instructions.

x86-64 boasts many more registers, primarily for SIMD and floating-point operations, which are outside the scope of this introductory learn assembly series.

Memory and Addresses: The Program’s Workspace

Imagine memory as a vast array of byte-sized “cells,” sequentially numbered from zero upwards. These numbers are “memory addresses.” Simple concept, right?

However, memory addressing was historically complex, especially in older x86 architectures. Early x86 processors used 16-bit registers, limiting direct addressability to 64 kilobytes. While hardware could handle 20-bit addresses, it relied on segment registers holding “base” addresses. Instructions used 16-bit offsets within these segments to calculate the final 20-bit “linear” address. Separate segment registers managed code, data, and stack segments, which could overlap, making memory management intricate.

Modern x86-64 simplifies this significantly. Segment registers for code, data, and stack still exist and are loaded with special values, but user-space programmers generally don’t need to manage them directly. For practical purposes, assume all segments start at address 0 and span the entire addressable memory range. Thus, in x86-64, programs perceive memory as a single, contiguous array of bytes with sequential addresses starting at 0 – the “flat memory model.” This flat memory model greatly simplifies the process when you learn assembly.

However, this is a simplification. Complete freedom to read and write any memory byte would lead to programs overwriting each other’s data and code – a real problem in older systems. Modern operating systems and CPUs implement protection mechanisms to prevent this. While in-depth exploration is OS developer territory, a brief overview is helpful:

Each process gets its own “virtual address space,” the flat address space described earlier. The OS manages a mapping between these virtual addresses and actual physical memory addresses for each process using paging. This mapping is enforced by hardware, dynamically translating virtual addresses to physical addresses during runtime. This is how process isolation is achieved – the same virtual address (e.g., 0x410F119C) can point to different physical memory locations in different processes.

Finally, it’s crucial to understand that instructions and the data they manipulate reside in the same memory. This is characteristic of the von Neumann architecture, contrasting with the Harvard architecture where instructions and data are in separate memories (e.g., AVR microcontrollers in Arduinos). This shared memory space is a fundamental aspect when you learn assembly on x86-64.

Your First Assembly Program: Minimal and Functional

Hopefully, you’ve downloaded FASM and are eager to write code. Our first program will be deliberately simple: it will load and immediately exit. The primary goal is to familiarize ourselves with the tools.

Here’s the assembly code for our first x86-64 program, a crucial step as you learn assembly:

format PE64 NX GUI 6.0
entry start

section '.text' code readable executable
start:
    int3
    ret

Code Breakdown: Line by Line

Let’s dissect this code line by line to understand the basics as you learn assembly.

format PE64 NX GUI 6.0: This FASM directive specifies the output binary format – Portable Executable (PE) format, the standard for Windows programs. We’ll delve into PE format details later.
entry start: Defines the program’s entry point, the starting execution address. “start” is a label, a symbolic name for an address within our program. This line tells the assembler that program execution begins at the address marked by the “start” label. Labels can be referenced even before their definition in the code.
section '.text' code readable executable: This directive initiates a new section within the PE file, named ‘.text’, designated for executable code. Sections are fundamental to the PE format, as we’ll see later.
start:: This line defines the “start” label, marking the program’s entry point as specified in the entry directive. Labels themselves don’t generate machine code; they are merely markers within the executable’s address space, essential for navigating your code as you learn assembly.
int3: This is a special instruction that triggers a debug exception handler. When executed under a debugger, int3 pauses the program, allowing inspection of its state and step-by-step execution. This is the mechanism behind breakpoints – debuggers replace an instruction byte with the int3 opcode. We’re hardcoding a breakpoint at the entry point for convenience, avoiding manual breakpoint setup in the debugger every time.
ret: The ret instruction (return) pops an address from the stack top and transfers program execution to that address. In our case, it will return control to the OS code that initially launched our program.

Launch FASMW.EXE, paste this code into the editor, save the file, and press Ctrl+F9. Congratulations, you’ve assembled your first assembly program! Now, let’s load it into a debugger and step through its execution to see it in action, a vital practice when you learn assembly.

Debugging Your First Program with WinDbg

Open WinDbg. In the “View” tab, ensure “Disassembly,” “Registers,” “Stack,” “Memory,” and “Command” windows are visible. Go to “File” > “Launch Executable” and select the executable you just built with FASM. Your WinDbg workspace should now resemble this:

WinDbg Initial View Before Program Execution

The disassembly window displays the code currently being executed. Initially, it’s OS loader code, responsible for loading our program into memory and eventually transferring control to our entry point. WinDbg starts with a breakpoint before any of this occurs.

The registers window shows the current values of the x86-64 registers we discussed earlier.

The memory window displays raw memory contents around a specified virtual address, useful for data inspection later on as you learn assembly.

The stack window shows the current call stack (initially within ntdll.dll, the Windows NT DLL).

The command window allows entering text commands and displays debugger output.

Pressing F5 at this point continues program execution until the next breakpoint. The next breakpoint will be the int3 we hardcoded. Press F5, and you should see something like this:

WinDbg View Paused at the INT3 Breakpoint in Our Program

You should recognize the int3 and ret instructions from our code in the disassembly window. To execute the next instruction, press F8 (step over). Observe the registers window; the rip register updates as you step through instructions (WinDbg highlights changed registers in red).

After executing ret, program control returns to the OS code that invoked our program’s entry point.

WinDbg View After Stepping Over the RET Instruction

As shown, the next action is a call to RtlExitUserThread, a function with a self-explanatory name. Pressing F5 now will allow your program’s main thread to terminate, and the program will exit. Or will it?

Using ret is a shortcut. On Windows, process termination occurs under these conditions:

A thread explicitly calls the WinAPI function ExitProcess.
All threads within the process have exited.

While our main thread is exiting, there’s no guarantee that Windows hasn’t initiated other background threads (for DLL loading, etc.) within our process. It appears that in this simple case, the main thread is indeed the only thread (checking confirms the process terminates). However, this behavior isn’t guaranteed and could change. A robust Windows program should always explicitly call ExitProcess when intended to terminate.

To call WinAPI functions like ExitProcess, we need to understand the Portable Executable file format, DLL loading, and calling conventions.

PE Format and DLL Imports: Calling External Functions

The ExitProcess function resides within KERNEL32.DLL (KERNEL32 is indeed the 64-bit library name; 32-bit compatibility versions are in SysWOW64 – not a typo!). To use it, we must import it into our program.

We won’t exhaustively cover the Portable Executable (PE) format here. The Microsoft documentation provides comprehensive details. Key PE format aspects for our learn assembly journey include:

PE files are structured into sections. We’ve already seen the ‘.text’ section for code. Sections can hold various data types.
Import information – which symbols are imported from which DLLs – is stored in the ‘.idata’ section.

Let’s examine the ‘.idata’ section structure. As per the PE format documentation, the ‘.idata’ section starts with an Import Directory Table (IDT). Each IDT entry corresponds to a DLL, is 20 bytes long, and contains these fields:

A 4-byte Relative Virtual Address (RVA) of the Import Lookup Table (ILT), holding imported function names.
A 4-byte timestamp (typically 0).
Forwarder chain index (typically 0).
A 4-byte RVA of a null-terminated DLL name string.
A 4-byte RVA of the Import Address Table (IAT). The IAT structure mirrors the ILT. However, the loader modifies the IAT at runtime, overwriting each entry with the actual address of the imported function. Theoretically, ILT and IAT can point to the same memory. Setting the ILT pointer to zero can also work, though official support is uncertain.

The IDT is terminated by a zero-filled entry. The ILT/IAT is an array of 64-bit values, null-terminated. The lower 31 bits of each entry are the RVA of a hint/name table entry containing the imported function name. During runtime, IAT entries are replaced with the actual function addresses.

The hint/name table contains entries, each even-boundary aligned. Each entry begins with a 2-byte hint (ignored for now) and a null-terminated imported function name string, followed by a null byte (if needed) for even alignment.

With this PE format understanding, let’s define our executable’s ‘.idata’ section in FASM:

section '.idata' import readable writeable
idt:    ; import directory table starts here
    ; entry for KERNEL32.DLL
    dd rva kernel32_iat
    dd 0
    dd 0
    dd rva kernel32_name
    dd rva kernel32_iat
    ; NULL entry - end of IDT
    dd 5 dup(0)

name_table: ; hint/name table
_ExitProcess_Name:
    dw 0
    db "ExitProcess", 0, 0

kernel32_name:
    db "KERNEL32.DLL", 0

kernel32_iat: ; import address table for KERNEL32.DLL
ExitProcess:
    dq rva _ExitProcess_Name
    dq 0    ; end of KERNEL32's IAT

The section directive is familiar. Here, ‘.idata’ is for import data and must be writeable because imported function addresses will be written into it during program loading.

db, dw, dd, and dq directives emit raw byte, word, double-word, and quad-word values, respectively. The rva operator yields the Relative Virtual Address of its operand. dd rva kernel32_iat emits a 4-byte value equal to the RVA of the kernel32_iat label.

We’ve used FASM directives to precisely describe the ‘.idata’ section content, a crucial step to learn assembly and interact with external libraries.

64-bit Windows Calling Convention: Function Communication

We’re nearing the point of calling ExitProcess. But first, how do function calls actually work at the assembly level? The call instruction pushes the current rip value onto the stack and jumps to the target address. The ret instruction pops an address from the stack and jumps there. However, argument passing and return value handling aren’t inherently defined by these instructions. These are governed by calling conventions, agreements between caller and callee. Conventions dictate rules like:

Caller pushes arguments onto the stack (last to first).
Callee removes parameters from the stack before returning.
Callee places return values in the eax register.

A calling convention is such a set of rules. Many conventions exist. When calling functions in assembly, knowing the expected calling convention is essential.

Fortunately, 64-bit Windows primarily uses one calling convention: the Microsoft x64 calling convention. It’s more complex than older conventions, passing initial arguments in registers (for performance) rather than solely on the stack.

For our purposes, key aspects of the x64 calling convention are:

Stack pointer must be 16-byte aligned.
First four integer/pointer arguments are passed in registers rcx, rdx, r8, and r9. First four floating-point arguments in xmm0 to xmm3. Subsequent arguments are stack-passed.
Even for register-passed arguments, the caller must allocate 32 bytes of “shadow space” on the stack. This applies even with fewer than four arguments.
The caller is responsible for stack cleanup.

Armed with this knowledge, we can finally call ExitProcess:

format PE64 NX GUI 6.0
entry start

section '.text' code readable executable
start:
    int3
    sub rsp, 8 * 5  ; adjust stack ptr and allocate shadow space
    xor rcx, rcx    ; The first and only argument (exit code) in rcx.
    call [ExitProcess]

section '.idata' import readable writeable
idt:    ; import directory table starts here
    ; entry for KERNEL32.DLL
    dd rva kernel32_iat
    dd 0
    dd 0
    dd rva kernel32_name
    dd rva kernel32_iat
    ; NULL entry - end of IDT
    dd 5 dup(0)

name_table: ; hint/name table
_ExitProcess_Name:
    dw 0
    db "ExitProcess", 0, 0

kernel32_name:
    db "KERNEL32.DLL", 0

kernel32_iat: ; import address table for KERNEL32.DLL
ExitProcess:
    dq rva _ExitProcess_Name
    dq 0    ; end of KERNEL32's IAT

New lines explained as you continue to learn assembly:

sub rsp, 8 * 5: The sub instruction subtracts the second operand from the first, storing the result in the first. Here, we subtract 40 from the stack pointer (rsp). Stacks grow downwards in memory, so subtraction allocates stack space. This line aligns the stack to a 16-byte boundary and allocates shadow space for the first four arguments simultaneously. Before our entry point, the stack pointer was 16-byte aligned. The call instruction pushed a return address (8 bytes), misaligning it. Subtracting another 8 bytes re-aligns it, and 32 more bytes account for shadow space, hence 40 (5 * 8).
xor rcx, rcx: The first integer argument for ExitProcess (the exit code) is passed in rcx. xor rcx, rcx efficiently sets rcx to zero (bitwise XOR with itself always results in zero). We’re setting the exit code to 0.
call [ExitProcess]: This line finally calls ExitProcess. Square brackets [] denote indirection. Instead of calling the address of the ExitProcess label directly, the value at that memory address is used as the call target. ExitProcess label points to the import table entry where the loader has placed the actual address of the ExitProcess function from KERNEL32.DLL.

Run this in WinDbg, step through it, and observe the registers (rsp and rcx) as ExitProcess is called.

WinDbg View Stepping Through the Call to ExitProcess

That concludes this first part of our learn assembly series. Next time, we’ll move beyond simple program exit and explore more interesting assembly programming concepts.

Enjoyed this post? Follow me on bluesky for more!

Kickstart Your Journey to Learn Assembly Language: A Beginner’s Guide to x86-64