sia.hackernoon.com

I still remember staring at my TM4C123GH6PM’s LEDs one evening, waiting for a simple “heartbeat” blink that never came. I thought I’d done everything right: set up SysTick, pended PendSV, and initialized my Thread Control Blocks (TCBs). But instead of a steady blink, the LEDs twitched once, then froze—mocking me. That moment crystallized what building a tiny RTOS really means: wrestling with hardware quirks, chasing elusive bugs, and patching together just enough code to make everything run smoothly. In this article, I’ll walk you through my journey of crafting a minimal, priority-based scheduler on an ARM Cortex-M4—warts, weird hacks, and late-night “aha!” moments included.

A Bit of Backstory

About a year ago, my embedded-systems class assigned us to build a simple RTOS from scratch on a TM4C123GH6PM (ARM Cortex-M4). I’d been using FreeRTOS before, but I never fully understood what went on behind the scenes. So I decided to strip it down: no fancy features, just threads, priorities, basic semaphores, and a shell over UART to peek under the hood.

Why?

I wanted to see exactly how the CPU switches from one thread to another.
I needed to learn why low-priority tasks can inadvertently starve higher-priority ones (hello, priority inversion).
I craved real debugging stories—like that time I spent half a day wondering why PendSV never ran (turns out I’d given it the same priority as SysTick).

Spoiler: My RTOS wasn’t perfect. But in chasing its flaws, I learned more about ARM internals than from any textbook.

Diving into the Cortex-M4 World

Before writing a single line of code, I had to wrap my head around how Cortex-M4 handles interrupts and context switching. Here’s what bit (pun intended) me early on:

PendSV’s “Gentle” Role
- PendSV is designed to be the lowest-priority exception—meaning it only runs after all other interrupts finish. My first mistake was setting PendSV’s priority to 0 (highest). Of course, it never ran because every other interrupt always preempted it. Once I moved it to 0xFF, scheduling finally kicked in.
- Note to self (and you): write down your NVIC priorities on a Post-it. It’s easy to confuse “0 is highest” with “0 is lowest.”
SysTick as the Heartbeat
- I aimed for a 1 ms tick. On a 16 MHz clock, that meant LOAD = 16 000 – 1.
- I initially tried to do all scheduling decisions inside the SysTick ISR, but that got messy. Instead, I now just decrement sleep counters there and set the PendSV pending bit. Let the “real” context switch happen in PendSV.
Exception Stack Frame
- When any exception fires, hardware auto-pushes R0–R3, R12, LR, PC, and xPSR. That means my “fake” initial stack for each thread must match this exact layout—else, on the very first run, the CPU will attempt to pop garbage and crash.
- I once forgot to set the Thumb bit (0x01000000) in xPSR. The result was an immediate hard fault. Lesson: that Thumb flag is non-negotiable.

Crafting the Thread Control Block (TCB)

Every thread in my RTOS holds:

typedef enum { READY, RUNNING, BLOCKED, SLEEPING } state_t;

typedef struct {
    uint32_t *stack_ptr;   // Saved PSP for context switches
    uint8_t   priority;    // 0 = highest, larger = lower priority
    state_t   state;       // READY, RUNNING, BLOCKED, or SLEEPING
    uint32_t  sleep_ticks; // How many SysTick ticks remain, if sleeping
} tcb_t;

In practice, I declared:

#define MAX_THREADS 9
#define STACK_WORDS 256

uint32_t    thread_stacks[MAX_THREADS][STACK_WORDS];
tcb_t       tcbs[MAX_THREADS];
uint8_t     thread_count = 0;
int         current_thread = -1;

A Stack-Overflow Horror Story

When I first assigned STACK_WORDS = 128, everything seemed fine—until my “worker” thread, which did a few nested function calls, started corrupting memory. The LED would blink twice, then vanish. By filling each stack with 0xDEADBEEF at startup and checking how far it got overwritten, I discovered that 128 words weren’t enough under optimized build flags. Bumping to 256 words fixed it.

Spawning a New Thread

Creating a thread meant carving out its stack and simulating the hardware stacking that happens on exception entry. Here’s the routine, with comments on my early pitfalls:

void rtos_create_thread(void (*fn)(void), uint8_t prio) {
    int id = thread_count++;
    tcb_t *t = &tcbs[id];
    uint32_t *stk = &thread_stacks[id][STACK_WORDS - 1];

    // Simulate hardware stacking (xPSR, PC, LR, R12, R3, R2, R1, R0)
    *(--stk) = 0x01000000;            // xPSR: Thumb bit set
    *(--stk) = (uint32_t)fn;          // PC → thread entry point
    *(--stk) = 0xFFFFFFFD;            // LR → return with PSP in Thread mode
    *(--stk) = 0x12121212;            // R12 (just a marker)
    *(--stk) = 0x03030303;            // R3
    *(--stk) = 0x02020202;            // R2
    *(--stk) = 0x01010101;            // R1
    *(--stk) = 0x00000000;            // R0

    // Save space for R4–R11 (popped by the context switch)
    for (int r = 4; r <= 11; r++) {
        *(--stk) = 0x0; // or use a pattern if you want to measure usage
    }

    t->stack_ptr   = stk;
    t->priority    = prio;
    t->state       = READY;
    t->sleep_ticks = 0;
}

Pitfalls to Watch For

Thumb Bit in xPSR. Missing it means your CPU will try to interpret code as ARM instructions—immediate fault.
The Magic 0xFFFFFFFD. This tells the CPU, “On exception return, use PSP and go to Thread mode.” I recall looking up the ARM ARM (Architecture Reference Manual) at least three times to get this right.
Stack Order. Pushing R4–R11 manually must follow the exact order the handler expects. A simple typo (e.g., pushing R11 first instead of R4) throws off the entire frame.

Letting Time Pass: SysTick Handler

Here’s my final SysTick ISR, trimmed to essentials:

void SysTick_Handler(void) {
    for (int i = 0; i < thread_count; i++) {
        if (tcbs[i].state == SLEEPING) {
            if (--tcbs[i].sleep_ticks == 0) {
                tcbs[i].state = READY;
            }
        }
    }
    SCB->ICSR |= SCB_ICSR_PENDSVSET_Msk; // Pend PendSV for scheduling
}

A couple of notes:

In an early version, I tried calling rtos_schedule() right inside SysTick. That led to nested interrupts and stack confusion. Now I just pend PendSV and let it handle the heavy lifting.
I discovered that if SysTick_IRQn and PendSV_IRQn share the same priority, PendSV sometimes never runs. Always give PendSV the absolute lowest numeric priority (i.e., NVIC_SetPriority(PendSV_IRQn, 0xFF)), and keep SysTick slightly higher (e.g., 2 or 3).

The Big Switch: PendSV Handler

When PendSV finally fires, it does the actual context switch. My implementation in GCC inline assembly (ARM syntax) looks like this:

__attribute__((naked)) void PendSV_Handler(void) {
    __asm volatile(
        "MRS   R0, PSP                \n" // Get current PSP
        "STMDB R0!, {R4-R11}          \n" // Push R4–R11 onto stack
        "LDR   R1, =current_thread    \n"
        "LDR   R2, [R1]               \n"
        "STR   R0, [R2, #0]           \n" // Save updated PSP into TCB

        "BL    rtos_schedule          \n" // Decide next thread

        "LDR   R1, =current_thread    \n"
        "LDR   R2, [R1]               \n"
        "LDR   R0, [R2, #0]           \n" // Load next thread’s PSP
        "LDMIA R0!, {R4-R11}          \n" // Pop R4–R11 from its stack
        "MSR   PSP, R0                \n" // Update PSP to new thread
        "BX    LR                     \n" // Exit exception, restore R0–R3, R12, LR, PC, xPSR
    );
}

How I Tracked Down That Pesky 32-Byte Offset

At first, my “save/restore” code was off by 32 bytes. The symptom? Threads would run, then somehow come back to life in the wrong place—garbled instructions, random jumps. I added a few GPIO toggles (toggling an LED pin just before and after STMDB) to measure exactly how many bytes were pushed. In my debugger, I then compared the numerical PSP value to the tcbs[].stack_ptr I expected. Sure enough, I’d accidentally used STMDB R0!, {R4-R11, R12} instead of {R4-R11}—pushing one extra register. Removing that extra pushed register fixed it.

Picking the Next Thread: Scheduler Logic

My scheduler is intentionally simple. It scans all TCBs to find the highest-priority READY thread, with a tiny round-robin tweak for threads of equal priority:

void rtos_schedule(void) {
    int next = -1;
    uint8_t best_prio = 0xFF;

    for (int i = 0; i < thread_count; i++) {
        if (tcbs[i].state == READY) {
            if (tcbs[i].priority < best_prio) {
                best_prio = tcbs[i].priority;
                next = i;
            } else if (tcbs[i].priority == best_prio) {
                // Simple round-robin: if i is after current, pick it
                if (i > current_thread) {
                    next = i;
                    break;
                }
            }
        }
    }
    if (next < 0) {
        // No READY threads—fall back to idle thread (ID 0)
        next = 0;
    }
    current_thread = next;
}

What I Learned Here:

If two threads share priority 1 but I always pick the lower ID, the other one would never get CPU time. By checking i > current_thread, the “twin” finally gets its turn.
The idle thread (ID 0) is a special case: always READY, always lowest priority (priority = 255), so that if nothing else is runnable, it just spins (or calls __WFI() to save power).

Kernel Primitives: Yield, Sleep, and Semaphores

After threading basics, the next step was making threads interact properly.

Yield

void rtos_yield(void) {
    SCB->ICSR |= SCB_ICSR_PENDSVSET_Msk;
}

I like sprinkling rtos_yield() at the end of long loops so other threads get a fair shot. In early tests, omitting yield() meant some tasks hogged the CPU under certain priority configurations.

Sleep

void rtos_sleep(uint32_t ticks) {
    tcb_t *self = &tcbs[current_thread];
    self->sleep_ticks = ticks;
    self->state       = SLEEPING;
    rtos_yield();
}

My “LED blinking” thread calls rtos_sleep(500). When I watch it blink every half-second, I know SysTick and PendSV are doing their jobs correctly.

Semaphores

Initially, I tried a naive approach:

typedef struct {
    volatile int count;
    int waiting_queue[MAX_THREADS];
    int head, tail;
} semaphore_t;

void rtos_sem_wait(semaphore_t *sem) {
    __disable_irq();
    sem->count--;
    if (sem->count < 0) {
        sem->waiting_queue[sem->tail++] = current_thread;
        tcbs[current_thread].state = BLOCKED;
        __enable_irq();
        rtos_yield();
    } else {
        __enable_irq();
    }
}

void rtos_sem_post(semaphore_t *sem) {
    __disable_irq();
    sem->count++;
    if (sem->count <= 0) {
        int tid = sem->waiting_queue[sem->head++];
        tcbs[tid].state = READY;
    }
    __enable_irq();
}

Priority Inversion Nightmare

One day, I had three threads:

T0 (Priority 2): holds the semaphore.
T1 (Priority 1): waiting on that semaphore.
T2 (Priority 3): ready to run and higher than T0 but lower than T1.

Because T1 was blocked, T2 kept running—never giving T0 a chance to release the semaphore for T1. T1 starved. My quick hack was to temporarily boost T0’s priority when T1 blocked—it’s a rudimentary priority inheritance. A full solution would track which thread holds the semaphore and lift its priority automatically. That’s an exercise I left for later.

Kicking Everything Off: `rtos_start()`

In main(), after basic hardware init (clocks, GPIO for LEDs, UART for the shell), I did:

// 1. Initialize SysTick and PendSV priorities
systick_init(16000);        // 1 ms tick on a 16 MHz system
NVIC_SetPriority(PendSV_IRQn, 0xFF);

// 2. Create threads (ID 0 = idle thread)
rtos_create_thread(idle_thread, 255);
rtos_create_thread(shell_thread, 3);
rtos_create_thread(worker1, 1);
rtos_create_thread(worker2, 2);

// 3. Switch to PSP and unprivileged Thread mode
current_thread = 0;
__set_PSP((uint32_t)tcbs[0].stack_ptr);
__set_CONTROL(0x02); // Use PSP, unprivileged
__ISB();

// 4. Pend PendSV to start first context switch
SCB->ICSR |= SCB_ICSR_PENDSVSET_Msk;

// 5. Idle loop
while (1) {
    __WFI();  // Save power until next interrupt
}

A couple final notes:

Idle Thread (ID 0): It simply toggles an LED at low priority. If something goes wrong, I know I at least got to the idle thread.
UART Interrupt Priority: UART’s TX interrupt needed a higher priority than PendSV; otherwise, long printf calls would get interrupted mid-transmit, mangling the output.

The UART Shell: Peeking Under the Hood

I built a tiny shell so I could type commands over UART and inspect thread states:

void shell_thread(void) {
    char buf[64];
    while (1) {
        uart_print("> ");
        uart_read_line(buf, sizeof(buf));

        if (strncmp(buf, "ps", 2) == 0) {
            for (int i = 0; i < thread_count; i++) {
                uart_printf("TID %d: state=%d prio=%d\n",
                    i, tcbs[i].state, tcbs[i].priority);
            }
        } else if (strncmp(buf, "sleep ", 6) == 0) {
            uint32_t msec = atoi(&buf[6]);
            rtos_sleep(msec);
        } else if (strncmp(buf, "kill ", 5) == 0) {
            int tid = atoi(&buf[5]);
            if (tid >= 1 && tid < thread_count) { // don’t kill idle
                tcbs[tid].state = BLOCKED; // crude kill
                uart_printf("Killed thread %d\n", tid);
            }
        } else {
            uart_print("Unknown command\n");
        }
    }
}

This shell was my safety net: if the LEDs behaved strangely, I’d hop into my serial terminal, type ps, and immediately see which threads were READY, BLOCKED, or SLEEPING. It saved me hours of guesswork.

Lessons Learned Along the Way

Interrupt Priorities Are Everything I spent an entire afternoon convinced my PendSV code was wrong—until I realized I’d set SysTick and PendSV to the same priority. Once I gave SysTick a higher priority, PendSV started firing reliably. Lesson: double- and triple-check your NVIC_SetPriority() calls.
Measure Your Stack Usage Pre-fill each thread’s stack with a known pattern (0xDEADBEEF) at startup. After running, inspect memory to see how deep the pattern got overwritten. If your stack pointer ever walks into that pattern, you know you need a bigger stack. I learned this the hard way when a deeper call chain in my worker thread caused a silent overwrite.
Use LEDs & GPIO to Debug If you don’t have a fancy debugger, just toggle a GPIO pin (hook it to an LED). I placed one LED toggle at the very start of PendSV_Handler and another at its end. Watching those blinks on a logic analyzer helped me verify that the handler ran the expected number of times—and at roughly the right intervals.
Keep It Simple at First My very first RTOS version tried to support dynamic task creation at runtime—massive mistake. I ended up with memory fragmentation and weird crashes. By freezing the number of threads at startup (just nine slots in my case), I avoided a world of pain.
Document Every Magic Number That 0xFFFFFFFD value in the “fake” stack frame? I wrote a short note: “Link register value that indicates returning to Thread mode using PSP.” Without that comment, I would’ve Googled “ARM exception return values” every time I revisited the code.

Next Steps: Where to Go from Here

If you decide to build on this foundation, here are a few ideas:

Full Priority Inheritance for Semaphores Rather than my quick “boost-and-forget” hack, implement a proper protocol where the highest-priority blocked thread’s priority is inherited until the semaphore is released.
Inter-Thread Message Queues Let threads send small messages or pointers to each other safely. Think of it like a tiny mailbox for passing data between tasks.
Dynamic Task Creation/Deletion Tack on a small heap manager or memory pool so you can create and kill threads on the fly—mind the extra complexity!
Profiling Hooks Expand your SysTick handler (or use another timer) to log how many ticks each thread runs. You could output a simple report over UART showing CPU usage by task.
Real-Time Safety Checks Introduce stack usage limits and detect when a thread’s stack pointer crosses into the 0xDEADBEEF region—trigger a safe shutdown or reset instead of random crashes.

Final Thoughts

Building a minimal RTOS is messy, often frustrating, and absolutely enlightening. From chasing a misplaced PendSV priority to wrestling with stack overflows, every bug forced me to understand Cortex-M4 internals more deeply. If you try this on your own TM4C or any other Cortex-M part, expect a few nights of debugging—but also the deep satisfaction when that first reliable LED blink finally appears.

If you give this a shot, please let me know which part drove you up the wall. I’d love to hear your own “LED blinking” stories or any other tricks you discovered while chasing ghosts in your scheduler. Happy hacking!

Getting Started with RTOS: A Hands-On Guide for Beginners Using Cortex-M4