KFS Series 1 - Boot sequence primitives

Building a kernel from scratch: The bare bones

In this first step, we'll create the bare bones of our kernel. This foundational work will serve as our genesis and evolve throughout this series.

From userspace to bare metal

As programmers and engineers, we typically operate in userspace, safely protected by Ring 3. Our applications interact with kernel resources through syscalls. Process memory appears completely virtual from our perspective, and a crash in our program won't bring down the entire system that's the abstraction layer doing its job.

Kernel development is fundamentally different. We're operating in Ring 0 the highest privilege level, with unrestricted access to hardware and memory. We're writing bare metal code. There are no syscalls to rely on, no concept of processes, we're the ones who will define these abstractions. We're building the foundation that everything else will stand on. That's what makes this the genesis.

To better understand where we are in the system, let’s review the boot sequence:

During boot, the BIOS/UEFI first executes POST (Power-On Self Test) to test RAM, initialize the GPU, and configure basic hardware interrupts, then reads the first sector (512 bytes) of the boot device containing the MBR which verifies the boot signature and transfers control to the bootloader. The MBR contains the partition table and a small executable code pointing to the bootloader, which then loads the kernel into memory, and finally jumps to the kernel's entry point. And here we are. But what would still be interesting would be to understand where this bare metal binary comes from, the one that serves as the kernel. By the way, a bare metal binary is an executable program designed to run directly on hardware with no operating system involved.

Cross-compiling for x86 architecture

Cross-compiling is the process of compiling code on a host machine to produce an executable that runs on a different target platform or architecture. Basically, this is what will allow us to start the development of our kernel, since executing kernel code directly would require booting it in place of the host kernel, which is unsafe and impractical during development. We need a toolchain that generates bare-metal binaries for the target architecture, independent of the host OS, which we'll then emulate using QEMU with KVM hardware acceleration.

An interesting fact: at some point your new operating system can be developed under itself. This is a process known as bootstrapping or going self-hosted. The osdev documentation (https://wiki.osdev.org/Bare_Bones) excels at setting up the GCC Cross-compiler for i686-elf in question.

Implementing the kernel

Design the realm of memory

The kernel's own memory is laid out here its position in physical memory, its sections, and its initial stack. This is the foundation from which the kernel will later manage memory for all other programs and processes. Once paging is enabled, all memory accesses including the kernel's own are translated by the MMU. The kernel doesn't bypass virtual memory, it operates within it, using page tables that map its code and data to the underlying physical frames.

boot.s - kernel entry point

This file contains the first instructions and declarations of the kernel. It includes the multiboot header. It defines and allocates the stack by reserving 16KiB of space in the .bss section to avoid unnecessarily monopolizing space while still reserving this. You can inspect starting from line 33 a kernel panic halt mechanism, if kernel main returns, which should never happen, we disable interruptions to avoid any UB (undefined behavior). Subsequently, we halt the CPU to save energy. However, it will wake up upon receiving a Non-Maskable Interrupt. With the jmp 1b, we ensure we always return to the CPU halt instead of letting undefined instructions follow in memory.

# Declare constants for the multiboot header.
.set ALIGN,    1<<0             # align loaded modules on page boundaries
.set MEMINFO,  1<<1             # provide memory map
.set FLAGS,    ALIGN | MEMINFO  # this is the Multiboot 'flag' field
.set MAGIC,    0x1BADB002       # 'magic number' lets bootloader find the header
.set CHECKSUM, -(MAGIC + FLAGS) # checksum of above, to prove we are multiboot

.section .multiboot
.align 4
.long MAGIC
.long FLAGS
.long CHECKSUM

# stack memory layout:
# stack_top <- %esp (initial stack pointer, stack grows downwards)
# ... 16 KiB of stack space ...
# stack_bottom (bottom of stack)

.section .bss
.align 16
.global stack_bottom
stack_bottom:
.skip 16384 # 16 KiB
.global stack_top
stack_top:

.section .text
.global _start
.type _start, @function
_start:
	mov $stack_top, %esp
	call kernel_main
	cli							# Disable interrupts FLAG IF = 0
1:	hlt
	jmp 1b 						# Without the JMP, the CPU would continue to execute the following in memory → undefined behavior / crash.

.size _start, . - _start

kernel.c - your actual kernel routines

Here we start and functionally initialize the kernel. In summary, we initialize the VGA terminal by writing to its memory with an expected code convention, launch the configuration of the GDT, PIC and IDT (memory/interrupt management), launch a trivial shell (hekashell) with trivial commands such as help, dmesg (to debug and read kernel-level logs), reboot and halt. We enable interruptions and loop infinitely.

https://github.com/TheHiddenShape/HekaOS_Series/blob/main/src/kernel.c

linker.ld - for linking the above files

This file serves to combine all our object files into a single executable, to indicate where to place each section in memory, and I emphasize at this level, we're operating directly on RAM. Later when we enable paging, we'll map these physical addresses to virtual addresses. Each section is aligned on 4KiB pages which is required for memory paging later on. The kernel is located at 2M, which avoids conflicts with low memory (BIOS, VGA, etc.). This is recommended by multiboot 2; previously we were limited to 1M.

ENTRY(_start)

SECTIONS
{

	/*. -> correspond to the location counter, it's like a cursor */

	. = 0x00000800;
    .gdt :
    {
        *(.gdt)
    }

	/* Kernel at 2M, to avoid overlaps with low memory */
	. = 2M;

	/* 	BLOCK(4K): Synonym for ALIGN(4K) in this context. Ensures that the section starts at an address that is a multiple of 4K.
		ALIGN(4K): Identical. Both exist for historical compatibility. You can use either one. */

	.text BLOCK(4K) : ALIGN(4K)
	{
		*(.multiboot)
		*(.text)
	}

	/* Read-only data. */
	.rodata BLOCK(4K) : ALIGN(4K)
	{
		*(.rodata)
	}

	/* Read-write data (initialized) */
	.data BLOCK(4K) : ALIGN(4K)
	{
		*(.data)
	}

	/* Read-write data (uninitialized) and stack */
	.bss BLOCK(4K) : ALIGN(4K)
	{
		/* Collects "common" symbols uninitialized global variables declared without an extern in multiple files. In C */
		*(COMMON)
		*(.bss)
	}
}

We start gathering the essential components for the kernel skeleton. Now we need to reflect what we see in memory by writing to VGA.

Screen rendering via VGA

VGA has two modes, text mode displays predefined characters from codes stored in memory and rendered via a font ROM, while graphics mode allows direct control of each pixel by writing color values to video memory.

In this kernel, we implement VGA text mode for display output and interrupts for handling hardware events such as keyboard input. VGA text mode operates via memory-mapped I/O. The video buffer is mapped at CPU-accessible address 0xB8000, allowing the processor to write directly to video memory. The VGA controller continuously reads this buffer and renders its contents on-screen, providing real-time visual feedback. For example, writing 0x0F41 to 0xB8000 displays a white 'A' on a black background at the top-left corner. The following formula converts a screen position (col, row) into a byte offset within the VGA buffer, assuming the memory pointer is a byte pointer (y * 80 + x) * 2, where x is the column (0–79) and y the row (0–24) in the 80×25 VGA text mode grid, multiplied by 2 since each character occupies 2 bytes (ASCII code + color attribute).

Interrupts

In order to implement the keyboard entries we need to set up the interrupt handling system. Interrupts are signals that temporarily suspend the CPU's current execution to handle urgent events. They're fundamental to how kernels manage hardware and respond to external events.

Below you can find the workflow of an IRQ. As you can observe, we need to implement both a GDT and an IDT. It is the kernel programmer's responsibility to define and load these structures into memory.

When IF=1, the CPU accepts hardware interrupts normally. When IF=0, it masks them they stay pending in the PIC and get serviced as soon as IF goes back to 1.

When an interrupt fires, the PIC sends a vector number to the CPU. The CPU uses it as an index into the IDT, where it finds a segment selector and an offset. It then looks up that selector in the GDT to verify it's a valid ring-0 code segment, and finally jumps to the handler at the given offset.

8042 Keyboard Controller

On x86 architecture, the PS/2 keyboard communicates with the 8042 controller (also called the keyboard controller, located in the above diagram between the hardware peripheral and the IRQ request), which is mapped to the following I/O ports:

port 0x60 (Data port (i/o buffer)): Read to retrieve the scan code that triggered the interrupt (inb(0x60)). Can also be written to send commands to the keyboard (e.g., toggling LEDs, changing scan code set).
port 0x64 (Status/command port): Read to check controller status: bit 0 indicates data is available in 0x60, bit 1 indicates the controller is ready for a command. Write to send controller commands (disable/enable keyboard, self-test).

When a scan code arrives, the 8042 controller: Stores it in its internal buffer Raises the IRQ1 line to signal the CPU that data is available. This buffer can only hold one scan code at a time. If a new scan code arrives before the CPU has read the previous one, data loss may occur (keyboard overrun).

8259A PIC

The 8259A PIC (Programmable Interrupt Controller) acts as a bridge between hardware IRQs and CPU interrupts, IRQ1 (keyboard) is mapped to interrupt vector 0x21 (assuming the PIC is programmed with offset 0x20 for master IRQ base).

The GDT and IDT

The GDT defines memory segments and privilege levels. The CPU needs it to resolve the segment selector stored in each IDT entry and to perform privilege transitions (ring 3 → ring 0) when an interrupt fires.

The IDT maps each interrupt vector to a handler. When vector 0x21 fires, the CPU looks up entry 0x21 in the IDT, retrieves the handler address and segment selector, consults the GDT to validate the segment, and jumps to the ISR.

Inside the interrupt service routine (ISR)

Once the CPU reaches the ISR wrapper, it begins by saving the execution context, the processor automatically pushes EIP, CS, and EFLAGS onto the stack (along with ESP and SS if a privilege level change occurs), and the wrapper then saves the remaining general-purpose registers using pusha. Next, the interrupt is handled by reading the scan code from port 0x60 via inb, translating it into a usable keycode, and updating the kernel's internal key buffer. After processing, an End of Interrupt (EOI) signal must be sent by writing 0x20 to port 0x20, acknowledging the interrupt to the PIC, without this step, no further IRQs will be delivered. Finally, the saved registers are restored and iret is executed to resume the interrupted code exactly where it left off.

GDT & IDT Implementation

The GDT defines segments: contiguous memory regions characterized by their base address, size, and access permissions. While the GDT describes memory segments (In reality, in modern systems, the GDT is primarily a privilege table it's what lets the CPU verify that the code it's about to execute has the right ring level to do so), the IDT (Interrupt Descriptor Table) serves a different but equally critical role, it tells the processor what to do when an interrupt or exception occurs. They share a similar packed structure and loading mechanism, but serve fundamentally different purposes.

To fully grasp how the GDT works, we first need to understand what defines an x-bit architecture matters because the GDT encodes precisely "which world" your CPU operates in. Without understanding what "x-bit" implies about memory addressing, you can't understand why segmentation exists, and therefore why the GDT exists. It's the difference between memorizing a structure and understanding its reason for being.

We will explore physical realities register width, bus capacity, and cycle mechanics which will provide useful background before exploring how the GDT was designed to work around them.

Let's dissect what defines an X-bit architecture

When we refer to a 16-bit, 32-bit, or 64-bit architecture, we're describing the width of the processor's fundamental components. At the heart of this are the registers small storage units inside the CPU that hold data being actively processed. A 16-bit architecture has 16-bit registers (2 bytes), a 32-bit architecture has 32-bit registers (4 bytes), and this width directly impacts how much data a single instruction can manipulate. But registers don't operate in isolation the CPU must communicate with RAM to fetch instructions and read or write data. This communication happens through buses, which are physical bundles of wires (traces etched into silicon and motherboard) where each wire carries one bit of information as an electrical signal, either high (1) or low (0). The data bus determines how many bits travel between CPU and memory in one transfer: 32 wires means 32 bits moved simultaneously. The address bus determines how many unique memory locations the processor can specify 32 address wires can express 2³² distinct addresses, which gives us the 4GB ceiling of 32-bit systems. These transfers occur during memory cycles, the fundamental rhythm of CPU-RAM communication, the processor places an address on the address bus, signals a read or write intent, and exchanges data via the data bus. The width of that data bus defines the memory word the chunk of data moved per cycle which is why wider architectures achieve higher throughput with fewer cycles.

Now that we've covered this concept, I'd like us to look at two operating modes of the x86 processor that are fundamentally different in how they handle memory.

Legacy model (Real Mode / Segmented model) vs Protected Flat Model

Here we are primarily contrasting the historical segmented memory model with modern linear addressing.

Legacy model

Real Mode has no memory protection, no GDT. It relies on segmentation using the following formula: physical address = (segment × 16) + offset. This calculation yields a 20-bit address space, which is perfectly aligned with the address bus of the 8086, which had exactly 20 pins. The 8086 had a 20-pin physical address bus, so it could address 2^20 = 1 MB. However, its internal registers were all 16-bit, which only gives 2^16 = 64 KB of direct addressing. The 4-bit shift (× 16) is the trick to "stretch" two 16-bit registers and reach those 20 address bits.

Protected Flat Model

The flat model provides a single contiguous address space ranging from 0x00000000 to 0xFFFFFFFF (base=0, limit=4GB). From the programmer's perspective, memory appears as a single continuous linear space, as if segmentation did not exist at all since the base is 0, the linear address simply equals the offset linear address = base (0) + offset = offset. Under the hood, Protected Mode enforces the use of the GDT, which is mandatory. Segment registers become selectors that point into the GDT/LDT, and these segment selectors point to descriptors within the GDT. This mechanism is what enables privilege separation between kernel and user space, enforced by the DPL (Descriptor Privilege Level), as well as code vs data segment separation the CPU will refuse to execute a data-flagged segment.

Loading the tables into GDTR and IDTR

Below are two assembly stubs for loading the GDT and IDT into memory. The inline comments fully document their purpose and usage.

gdt_flush.s

.global gdt_flush
.extern gp

gdt_flush:
    mov 4(%esp), %eax   # Get the pointer to the GDT, passed as a parameter
    lgdt (%eax)         # Load the new GDT pointer into GDTR

# segment register purpose reminder:
#
# CS: fetch instructions
# DS: data memory access
# SS: stack access
# FS/GS: explicit override

# Flat model, a single contiguous address space (0x00000000 - 0xFFFFFFFF (base=0, limit=4GB)).
#
# Kernel vs User space (privilege separation enforced by DPL)
# Code vs Data segment separation, the CPU refuses to execute a data-flagged segment.

    mov $0x10, %ax      # 0x10 is the offset in the GDT to our data segment
    mov %ax, %ds        # Load Data Segment
    mov %ax, %es        # Load Extra Segment
    mov %ax, %fs        # Load Additional segment 1 (free use)
    mov %ax, %gs        # Load Additional segment 2 (free use)
    mov %ax, %ss       # Load Stack segment kernel space, todo: in user space: 0x30

    # 0x08 is the offset to our code segment: Far jump!
    # load simultaneously cs = 0x08 and eip = addr flush2
    ljmp $0x08, $flush2

flush2:
    ret

idt_load.s


.global idt_load

idt_load:
    mov 4(%esp), %eax   # Get the pointer to the IDT, passed as a parameter
    lidt (%eax)         # Load the new IDT pointer into IDTR
    ret

The following diagram is intended to show how the GDT, IDT, and Segment Registers relate to each other, along with RAM and Kernel space.

The GDT is mainly used by the processor when accessing the IDT, because IDT entries don't contain full segment information. Instead, each IDT entry contains a segment selector, which is itself a reference to the GDT. So the CPU combines the segment base from the GDT with the offset from the IDT to calculate the final linear address of our interrupt handler.

By default, the Programmable Interrupt Controller (PIC) maps hardware IRQ lines to interrupt vectors 0–15, a design inherited from early x86 real-mode systems. However, in protected mode, the CPU reserves vectors 0–31 for CPU exceptions (divide by zero, page faults, general protection faults, etc.), creating a direct conflict where the CPU cannot distinguish whether a given vector represents a hardware IRQ or a CPU exception, making PIC remapping mandatory for proper interrupt handling.

In the following section, we will explore the mechanisms and concepts that operate at distinct levels of the interrupt handling chain, as well as their configuration.

Interrupt handling through the 8259 PIC

8259 PIC Reminder

The 8259A is the chip sitting between hardware peripherals and the CPU. It receives interrupt requests on its IR0–IR7 lines, resolves priority via three registers and signals the CPU through its INT output.

IRR (pending)
ISR (in-service)
IMR (masked)

How the PIC handles an incoming IRQ

The 8259A relies on three internal registers, the IRR (Interrupt Request Register) records which lines are currently asserted, the ISR (In-Service Register) tracks which interrupt the CPU is currently handling, and the IMR (Interrupt Mask Register) lets the kernel selectively disable specific lines via the data port.

The sequence: a device asserts its line (IRR bit set) → PIC checks priority against ISR and IMR → if unmasked and higher priority, asserts INT → CPU acknowledges, PIC places the vector (base offset + IRQ number) on the data bus → IRR bit clears, ISR bit sets → CPU resolves the vector through IDT/GDT and jumps to the handler → handler sends EOI → ISR bit clears, PIC is ready for the next interrupt.

Master & Slave, a cascaded architecture

A single 8259A only provides 8 IRQ lines. The IBM PC/AT extends this by cascading two chips: a Master covering IRQ0–IRQ7 (low-level hardware like the timer and keyboard) and a Slave covering IRQ8–IRQ15 (secondary hardware like the RTC, ATA disks, PS/2 mouse). The master is the only chip directly connected to the CPU's INTR pin the slave cannot signal the CPU on its own. Instead, the slave's INT output is wired into IRQ2 of the master, which acts as a relay. This cascade link consumes IRQ2, leaving 15 usable lines out of 16.

The CPU communicates with each PIC through two I/O ports, a command port for sending control instructions (initialization, EOI), and a data port for configuring behavior (masking IRQ lines, setting vector offsets). The master uses ports 0x20 (command) and 0x21 (data), while the slave uses 0xA0 and 0xA1. These port numbers are hardware-defined they correspond to the fixed I/O addresses at which each chip is mapped on the ISA bus.

Because the slave routes through the master, interrupt delivery differs depending on which chip owns the IRQ. A master IRQ goes straight to the CPU the ISR sends a single EOI to 0x20. A slave IRQ first traverses the cascade (slave → master IRQ2 → CPU), so the ISR must send two EOIs, first to the slave (0xA0), then to the master (0x20), to clear the in-service flag on both chips.

Why remapping is not optional

The 8259A was designed for 8086 real mode, where IBM mapped the master PIC to vectors 0x00–0x07 and the slave to 0x08–0x0F. When Intel introduced protected mode (80286/80386), vectors 0–31 were reserved for CPU exceptions, vector 0 for Divide Error, 8 for Double Fault, 13 for General Protection Fault, 14 for Page Fault, etc. But the PIC's default mapping remained unchanged.

The collision is not an edge case it is guaranteed. The timer (IRQ0) fires vector 0x00 (also Divide Error) dozens of times per second. The keyboard (IRQ1) fires vector 0x01 (also Debug exception). The CPU cannot distinguish a Double Fault from a timer tick. Without remapping, the system is immediately unstable.

The fix: reprogram the PIC's base offset at initialization via the ICW1–ICW4 protocol (outb writes to command and data ports). The convention is to remap master to vectors 0x20–0x27 and slave to 0x28–0x2F, safely above the exception range. After remapping, IRQ0 → vector 0x20, IRQ1 → vector 0x21 (the keyboard entry in our IDT), and hardware interrupts are cleanly separated from CPU exceptions.

Booting the kernel

To boot our kernel, we will use QEMU, which is an open-source emulator and virtualizer. The role of QEMU is to emulate hardware devices and provide the userspace for virtualization. We will use virtualization mode to leverage KVM, which uses hardware acceleration to execute native code directly on the CPU. This is where KVM comes into play.

What is KVM? KVM is a Linux kernel module that turns Linux into a hypervisor. It utilizes the hardware virtualization extensions of processors (Intel VT-x / AMD-V). KVM manages the execution of guest code directly on the physical CPU, intercepts sensitive instructions (privileged memory access, I/O), and provides hardware acceleration through /dev/kvm.

References

PreviousKFS Series NextKFS Series 2 - Memory

Last updated 2 days ago

hashtagBuilding a kernel from scratch: The bare bones

hashtagFrom userspace to bare metal

hashtagCross-compiling for x86 architecture

hashtagImplementing the kernel

hashtagDesign the realm of memory

hashtagScreen rendering via VGA

hashtagInterrupts

hashtag8042 Keyboard Controller

hashtag8259A PIC

hashtagThe GDT and IDT

hashtagInside the interrupt service routine (ISR)

hashtagGDT & IDT Implementation

hashtagLoading the tables into GDTR and IDTR

hashtagInterrupt handling through the 8259 PIC

hashtag8259 PIC Reminder

hashtagHow the PIC handles an incoming IRQ

hashtagMaster & Slave, a cascaded architecture

hashtagWhy remapping is not optional

hashtagBooting the kernel

hashtagReferences