I still remember staring at my TM4C123GH6PM’s LEDs one evening, waiting for a simple “heartbeat” blink that never came. I thought I’d done everything right: set up SysTick, pended PendSV, and initialized my Thread Control Blocks (TCBs). But instead of a steady blink, the LEDs twitched once, then froze—mocking me. That moment crystallized what building a tiny RTOS really means: wrestling with hardware quirks, chasing elusive bugs, and patching together just enough code to make everything run smoothly. In this article, I’ll walk you through my journey of crafting a minimal, priority-based scheduler on an ARM Cortex-M4—warts, weird hacks, and late-night “aha!” moments included.


A Bit of Backstory

About a year ago, my embedded-systems class assigned us to build a simple RTOS from scratch on a TM4C123GH6PM (ARM Cortex-M4). I’d been using FreeRTOS before, but I never fully understood what went on behind the scenes. So I decided to strip it down: no fancy features, just threads, priorities, basic semaphores, and a shell over UART to peek under the hood.

Why?

Spoiler: My RTOS wasn’t perfect. But in chasing its flaws, I learned more about ARM internals than from any textbook.


Diving into the Cortex-M4 World

Before writing a single line of code, I had to wrap my head around how Cortex-M4 handles interrupts and context switching. Here’s what bit (pun intended) me early on:

  1. PendSV’s “Gentle” Role
    • PendSV is designed to be the lowest-priority exception—meaning it only runs after all other interrupts finish. My first mistake was setting PendSV’s priority to 0 (highest). Of course, it never ran because every other interrupt always preempted it. Once I moved it to 0xFF, scheduling finally kicked in.
    • Note to self (and you): write down your NVIC priorities on a Post-it. It’s easy to confuse “0 is highest” with “0 is lowest.”
  2. SysTick as the Heartbeat
    • I aimed for a 1 ms tick. On a 16 MHz clock, that meant LOAD = 16 000 – 1.
    • I initially tried to do all scheduling decisions inside the SysTick ISR, but that got messy. Instead, I now just decrement sleep counters there and set the PendSV pending bit. Let the “real” context switch happen in PendSV.
  3. Exception Stack Frame
    • When any exception fires, hardware auto-pushes R0–R3, R12, LR, PC, and xPSR. That means my “fake” initial stack for each thread must match this exact layout—else, on the very first run, the CPU will attempt to pop garbage and crash.
    • I once forgot to set the Thumb bit (0x01000000) in xPSR. The result was an immediate hard fault. Lesson: that Thumb flag is non-negotiable.

Crafting the Thread Control Block (TCB)

Every thread in my RTOS holds:

typedef enum { READY, RUNNING, BLOCKED, SLEEPING } state_t;

typedef struct {
    uint32_t *stack_ptr;   // Saved PSP for context switches
    uint8_t   priority;    // 0 = highest, larger = lower priority
    state_t   state;       // READY, RUNNING, BLOCKED, or SLEEPING
    uint32_t  sleep_ticks; // How many SysTick ticks remain, if sleeping
} tcb_t;

In practice, I declared:

#define MAX_THREADS 9
#define STACK_WORDS 256

uint32_t    thread_stacks[MAX_THREADS][STACK_WORDS];
tcb_t       tcbs[MAX_THREADS];
uint8_t     thread_count = 0;
int         current_thread = -1;

A Stack-Overflow Horror Story

When I first assigned STACK_WORDS = 128, everything seemed fine—until my “worker” thread, which did a few nested function calls, started corrupting memory. The LED would blink twice, then vanish. By filling each stack with 0xDEADBEEF at startup and checking how far it got overwritten, I discovered that 128 words weren’t enough under optimized build flags. Bumping to 256 words fixed it.


Spawning a New Thread

Creating a thread meant carving out its stack and simulating the hardware stacking that happens on exception entry. Here’s the routine, with comments on my early pitfalls:

void rtos_create_thread(void (*fn)(void), uint8_t prio) {
    int id = thread_count++;
    tcb_t *t = &tcbs[id];
    uint32_t *stk = &thread_stacks[id][STACK_WORDS - 1];

    // Simulate hardware stacking (xPSR, PC, LR, R12, R3, R2, R1, R0)
    *(--stk) = 0x01000000;            // xPSR: Thumb bit set
    *(--stk) = (uint32_t)fn;          // PC → thread entry point
    *(--stk) = 0xFFFFFFFD;            // LR → return with PSP in Thread mode
    *(--stk) = 0x12121212;            // R12 (just a marker)
    *(--stk) = 0x03030303;            // R3
    *(--stk) = 0x02020202;            // R2
    *(--stk) = 0x01010101;            // R1
    *(--stk) = 0x00000000;            // R0

    // Save space for R4–R11 (popped by the context switch)
    for (int r = 4; r <= 11; r++) {
        *(--stk) = 0x0; // or use a pattern if you want to measure usage
    }

    t->stack_ptr   = stk;
    t->priority    = prio;
    t->state       = READY;
    t->sleep_ticks = 0;
}

Pitfalls to Watch For


Letting Time Pass: SysTick Handler

Here’s my final SysTick ISR, trimmed to essentials:

void SysTick_Handler(void) {
    for (int i = 0; i < thread_count; i++) {
        if (tcbs[i].state == SLEEPING) {
            if (--tcbs[i].sleep_ticks == 0) {
                tcbs[i].state = READY;
            }
        }
    }
    SCB->ICSR |= SCB_ICSR_PENDSVSET_Msk; // Pend PendSV for scheduling
}

A couple of notes:


The Big Switch: PendSV Handler

When PendSV finally fires, it does the actual context switch. My implementation in GCC inline assembly (ARM syntax) looks like this:

__attribute__((naked)) void PendSV_Handler(void) {
    __asm volatile(
        "MRS   R0, PSP                \n" // Get current PSP
        "STMDB R0!, {R4-R11}          \n" // Push R4–R11 onto stack
        "LDR   R1, =current_thread    \n"
        "LDR   R2, [R1]               \n"
        "STR   R0, [R2, #0]           \n" // Save updated PSP into TCB

        "BL    rtos_schedule          \n" // Decide next thread

        "LDR   R1, =current_thread    \n"
        "LDR   R2, [R1]               \n"
        "LDR   R0, [R2, #0]           \n" // Load next thread’s PSP
        "LDMIA R0!, {R4-R11}          \n" // Pop R4–R11 from its stack
        "MSR   PSP, R0                \n" // Update PSP to new thread
        "BX    LR                     \n" // Exit exception, restore R0–R3, R12, LR, PC, xPSR
    );
}

How I Tracked Down That Pesky 32-Byte Offset

At first, my “save/restore” code was off by 32 bytes. The symptom? Threads would run, then somehow come back to life in the wrong place—garbled instructions, random jumps. I added a few GPIO toggles (toggling an LED pin just before and after STMDB) to measure exactly how many bytes were pushed. In my debugger, I then compared the numerical PSP value to the tcbs[].stack_ptr I expected. Sure enough, I’d accidentally used STMDB R0!, {R4-R11, R12} instead of {R4-R11}—pushing one extra register. Removing that extra pushed register fixed it.


Picking the Next Thread: Scheduler Logic

My scheduler is intentionally simple. It scans all TCBs to find the highest-priority READY thread, with a tiny round-robin tweak for threads of equal priority:

void rtos_schedule(void) {
    int next = -1;
    uint8_t best_prio = 0xFF;

    for (int i = 0; i < thread_count; i++) {
        if (tcbs[i].state == READY) {
            if (tcbs[i].priority < best_prio) {
                best_prio = tcbs[i].priority;
                next = i;
            } else if (tcbs[i].priority == best_prio) {
                // Simple round-robin: if i is after current, pick it
                if (i > current_thread) {
                    next = i;
                    break;
                }
            }
        }
    }
    if (next < 0) {
        // No READY threads—fall back to idle thread (ID 0)
        next = 0;
    }
    current_thread = next;
}

What I Learned Here:


Kernel Primitives: Yield, Sleep, and Semaphores

After threading basics, the next step was making threads interact properly.

Yield

void rtos_yield(void) {
    SCB->ICSR |= SCB_ICSR_PENDSVSET_Msk;
}

I like sprinkling rtos_yield() at the end of long loops so other threads get a fair shot. In early tests, omitting yield() meant some tasks hogged the CPU under certain priority configurations.

Sleep

void rtos_sleep(uint32_t ticks) {
    tcb_t *self = &tcbs[current_thread];
    self->sleep_ticks = ticks;
    self->state       = SLEEPING;
    rtos_yield();
}

My “LED blinking” thread calls rtos_sleep(500). When I watch it blink every half-second, I know SysTick and PendSV are doing their jobs correctly.

Semaphores

Initially, I tried a naive approach:

typedef struct {
    volatile int count;
    int waiting_queue[MAX_THREADS];
    int head, tail;
} semaphore_t;

void rtos_sem_wait(semaphore_t *sem) {
    __disable_irq();
    sem->count--;
    if (sem->count < 0) {
        sem->waiting_queue[sem->tail++] = current_thread;
        tcbs[current_thread].state = BLOCKED;
        __enable_irq();
        rtos_yield();
    } else {
        __enable_irq();
    }
}

void rtos_sem_post(semaphore_t *sem) {
    __disable_irq();
    sem->count++;
    if (sem->count <= 0) {
        int tid = sem->waiting_queue[sem->head++];
        tcbs[tid].state = READY;
    }
    __enable_irq();
}

Priority Inversion Nightmare

One day, I had three threads:

Because T1 was blocked, T2 kept running—never giving T0 a chance to release the semaphore for T1. T1 starved. My quick hack was to temporarily boost T0’s priority when T1 blocked—it’s a rudimentary priority inheritance. A full solution would track which thread holds the semaphore and lift its priority automatically. That’s an exercise I left for later.


Kicking Everything Off: rtos_start()

In main(), after basic hardware init (clocks, GPIO for LEDs, UART for the shell), I did:

// 1. Initialize SysTick and PendSV priorities
systick_init(16000);        // 1 ms tick on a 16 MHz system
NVIC_SetPriority(PendSV_IRQn, 0xFF);

// 2. Create threads (ID 0 = idle thread)
rtos_create_thread(idle_thread, 255);
rtos_create_thread(shell_thread, 3);
rtos_create_thread(worker1, 1);
rtos_create_thread(worker2, 2);

// 3. Switch to PSP and unprivileged Thread mode
current_thread = 0;
__set_PSP((uint32_t)tcbs[0].stack_ptr);
__set_CONTROL(0x02); // Use PSP, unprivileged
__ISB();

// 4. Pend PendSV to start first context switch
SCB->ICSR |= SCB_ICSR_PENDSVSET_Msk;

// 5. Idle loop
while (1) {
    __WFI();  // Save power until next interrupt
}

A couple final notes:


The UART Shell: Peeking Under the Hood

I built a tiny shell so I could type commands over UART and inspect thread states:

void shell_thread(void) {
    char buf[64];
    while (1) {
        uart_print("> ");
        uart_read_line(buf, sizeof(buf));

        if (strncmp(buf, "ps", 2) == 0) {
            for (int i = 0; i < thread_count; i++) {
                uart_printf("TID %d: state=%d prio=%d\n",
                    i, tcbs[i].state, tcbs[i].priority);
            }
        } else if (strncmp(buf, "sleep ", 6) == 0) {
            uint32_t msec = atoi(&buf[6]);
            rtos_sleep(msec);
        } else if (strncmp(buf, "kill ", 5) == 0) {
            int tid = atoi(&buf[5]);
            if (tid >= 1 && tid < thread_count) { // don’t kill idle
                tcbs[tid].state = BLOCKED; // crude kill
                uart_printf("Killed thread %d\n", tid);
            }
        } else {
            uart_print("Unknown command\n");
        }
    }
}

This shell was my safety net: if the LEDs behaved strangely, I’d hop into my serial terminal, type ps, and immediately see which threads were READY, BLOCKED, or SLEEPING. It saved me hours of guesswork.


Lessons Learned Along the Way

  1. Interrupt Priorities Are Everything I spent an entire afternoon convinced my PendSV code was wrong—until I realized I’d set SysTick and PendSV to the same priority. Once I gave SysTick a higher priority, PendSV started firing reliably. Lesson: double- and triple-check your NVIC_SetPriority() calls.
  2. Measure Your Stack Usage Pre-fill each thread’s stack with a known pattern (0xDEADBEEF) at startup. After running, inspect memory to see how deep the pattern got overwritten. If your stack pointer ever walks into that pattern, you know you need a bigger stack. I learned this the hard way when a deeper call chain in my worker thread caused a silent overwrite.
  3. Use LEDs & GPIO to Debug If you don’t have a fancy debugger, just toggle a GPIO pin (hook it to an LED). I placed one LED toggle at the very start of PendSV_Handler and another at its end. Watching those blinks on a logic analyzer helped me verify that the handler ran the expected number of times—and at roughly the right intervals.
  4. Keep It Simple at First My very first RTOS version tried to support dynamic task creation at runtime—massive mistake. I ended up with memory fragmentation and weird crashes. By freezing the number of threads at startup (just nine slots in my case), I avoided a world of pain.
  5. Document Every Magic Number That 0xFFFFFFFD value in the “fake” stack frame? I wrote a short note: “Link register value that indicates returning to Thread mode using PSP.” Without that comment, I would’ve Googled “ARM exception return values” every time I revisited the code.

Next Steps: Where to Go from Here

If you decide to build on this foundation, here are a few ideas:


Final Thoughts

Building a minimal RTOS is messy, often frustrating, and absolutely enlightening. From chasing a misplaced PendSV priority to wrestling with stack overflows, every bug forced me to understand Cortex-M4 internals more deeply. If you try this on your own TM4C or any other Cortex-M part, expect a few nights of debugging—but also the deep satisfaction when that first reliable LED blink finally appears.

If you give this a shot, please let me know which part drove you up the wall. I’d love to hear your own “LED blinking” stories or any other tricks you discovered while chasing ghosts in your scheduler. Happy hacking!