Sia HackewrNoon

A few weeks ago, Andrej Karpathy dropped a repo called autoresearch on GitHub. Within a week, it had 60,000 stars. I have never seen a repo blow up that fast outside of maybe the original llama.cpp.

The pitch is wild - give an AI coding agent a small neural network training setup, let it modify the code, run experiments, keep what works, throw away what doesn't. Go to sleep. Wake up to a log of findings.

I had to try it. Not because I expected it to produce breakthrough results, but because I wanted to feel what it is like when the research loop doesn't need me anymore.

What AutoResearch Actually Is

Let me be clear about what this repo is, because the hype around it can be misleading. It is NOT a general-purpose research assistant. It is not going to write your PhD thesis.

What it IS: a single-file neural network training setup (train.py) paired with a markdown file (program.md) that tells an AI coding agent how to run experiments. The agent reads the program.md, understands the codebase, modifies train.py, kicks off a 5-minute training run, checks the validation loss, and either keeps the change or reverts it.

That's it. But "that's it" turns out to be absurdly powerful.

autoresearch/
├── prepare.py     # data prep + eval utilities (DON'T touch)
├── train.py       # the file the agent modifies
├── program.md     # instructions for the agent
├── pyproject.toml # dependencies
└── analysis.ipynb # analyze results

The genius is in how constrained it is. One file to modify. One metric to optimize (val_bpb - validation bits per byte). Fixed 5-minute time budget per experiment.

Git versioning for every change. No ambiguity.

Setting It Up

Requirements are surprisingly minimal: one NVIDIA GPU, Python 3.10+, and uv (Astral's package manager). I had an H100 available through a cloud instance, so I spun one up.

# install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# clone and setup
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync

# one-time data prep — downloads training data, trains BPE tokenizer
# this took about 2 minutes for me
uv run prepare.py

# test that a single training run works
uv run train.py

The single training run took exactly 5 minutes (wall clock time, excluding PyTorch compilation). It reported a baseline val_bpb score. This is the number the agent tries to beat.

The program.md: Programming the Programmer

In traditional ML research, I open train.py, tweak a hyperparameter, run the training, wait, check the loss, repeat. A good day is 10-15 experiments.

With autoresearch, I don't touch train.py at all. Instead, I edit program.md, a markdown file that describes to the AI agent what it should try, what the goals are, and what constraints to follow. Karpathy describes this as "programming the program."

The default program.md is deliberately bare-bones:

# Research Program

## Goal
Minimize val_bpb (validation bits per byte) for a small language model
trained on the FineWeb-Edu dataset.

## Rules
- Only modify train.py
- Each experiment runs for exactly 5 minutes
- If val_bpb improves, commit the change
- If val_bpb doesn't improve, revert
- Log everything

## What to try
- Architecture changes (depth, width, attention patterns)
- Optimizer tuning (learning rate, weight decay, warmup)
- Training tricks (gradient accumulation, batch size)

I stared at this for a while. The idea of writing research instructions in a markdown file is like I am leaving notes for a colleague.

Letting It Run

I pointed Cursor at the repo, told it to read program.md, and started experimenting. Then I closed my laptop and went to sleep.

I am not going to pretend I slept soundly. I checked my phone twice, once at midnight, once around 3 am. Both times, the terminal was still going. New branches getting created, training runs firing, reverts happening.

By morning, the agent had run 47 experiments in about 8 hours. Cleaned-up summary of what it tried:

# parsing the git log to see what experiments ran
import subprocess, re

log = subprocess.check_output(
    ["git", "log", "--oneline", "--all"], 
    text=True
)
experiments = [l for l in log.strip().split('\n') if 'experiment' in l.lower()]

# results from my overnight run (condensed):
# Exp 1:  baseline val_bpb = 1.847
# Exp 5:  increased depth 8->10, val_bpb = 1.831 ✓ KEPT
# Exp 9:  switched to cosine LR schedule, val_bpb = 1.824 ✓ KEPT
# Exp 14: doubled batch size, val_bpb = 1.826 ✗ REVERTED
# Exp 18: added RMSNorm pre-norm, val_bpb = 1.819 ✓ KEPT
# Exp 23: SwiGLU activation, val_bpb = 1.812 ✓ KEPT
# Exp 31: rotary position embeddings, val_bpb = 1.808 ✓ KEPT
# Exp 47: final val_bpb = 1.793

print(f"Total experiments: {len(experiments)}")
print(f"Improvement: {(1.847 - 1.793) / 1.847 * 100:.1f}%")
# about 2.9% improvement overnight

Forty-seven experiments. Eight kept. A 2.9% improvement on a system that was already using a decent architecture. All while I was literally asleep.

What Actually Surprised Me

The improvement percentage isn't what impressed me. 2.9% in one night on a small model is nice but not earth-shattering.

What surprised me was the creativity of the changes the agent tried. I expected it to just grid-search hyperparameters, tweak the learning rate up, tweak it down, done. Instead, it proposed architectural changes I wouldn't have thought to try in that combination. Swapping activation functions, changing normalization schemes, adjusting the attention window patterns.

One experiment it ran was particularly interesting: it tried replacing the standard SSSL sliding window attention pattern with a pure "L" (full) attention pattern. It got reverted because it was slower and the val_bpb didn't improve enough, but the reasoning in its commit message was sound. On my specific GPU, the memory bandwidth characteristics made the sliding window less efficient than expected. It figured that out empirically.

The agent was doing what I do. But faster, and without needing coffee.

The Bigger Picture: Agents Doing Research

There is an arXiv paper from late March 2026 - "Agentic AI and the Next Intelligence Explosion", that frames this moment nicely. The authors argue that the next jump in AI capability won't come from a single giant model getting smarter. It will come from networks of agents working together, specializing, and iterating.

AutoResearch is a tiny, early version of that idea. One agent, one task, one metric. But the principle scales. What if you had a team of agents - one proposing architecture changes, one tuning hyperparameters, one trying different data augmentation strategies, one reviewing results? Each with their own program.md instructions?

Karpathy hinted at this in his tweet thread. The repo is intentionally simple. The interesting work is figuring out the "research org code" - the program.md that produces the fastest research progress. In a weird way, writing good agent instructions is becoming its own form of research.

If you would like to try it...

Some things I learned the hard way:

You don't need an H100. The repo is designed around a single NVIDIA GPU, but there are already community forks for MacOS (MPS), Windows (RTX cards), and even AMD. Check the repo's README for links. If you are on a smaller GPU, Karpathy recommends using the TinyStories dataset, lowering the vocab size to 2048 or smaller, and reducing the sequence length. I would also drop DEPTH from 8 to 4.

Disable all safety confirmations. The agent needs to run commands without asking you for permission every time. If you're using Cursor or Claude, set it to auto-approve. The worst thing that happens is that a bad experiment gets reverted.

Check the git history. That's where all the value is. Every experiment is a commit (with a descriptive message) on its own branch. Kept experiments land on main. Reverted ones are still in the branch history. It is a complete experiment log.

# see all experiments, including reverted ones
git log --all --oneline --graph

# see what a specific experiment changed
git diff experiment-14^..experiment-14

Start with the default program.md. I was tempted to write elaborate instructions before my first run, but the default works fine. Let the agent do its thing first, then tighten the instructions based on what you see in the logs.

What This Means

The idea that I could define research goals in a markdown file, go to sleep, and wake up to genuine (if incremental) improvements... it genuinely changes the day-to-day reality of the work.

The human doesn't disappear from this loop. You still decide what questions to ask, what metrics matter, and what constraints the agent operates under. But the grind of "change one thing, train, wait, check, repeat" - that part is done. The agent does it cheaper, faster, and more patiently than I ever could.

Karpathy's README has this quote that stuck with me: "One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun... That era is long gone."

He's half-joking. But only half.

If you have a GPU and a few hours, try it. The repo is at github.com/karpathy/autoresearch. The most interesting thing isn't the code, but it's what it feels like to watch an agent do your job better than you while you sleep.

I Let Karpathy's AutoResearch Agent Run Overnight!