Andrej Karpathy is a name many people in the AI world will recognize. He was a professor at Stanford for a while, later became the head of Tesla’s AI division, worked at OpenAI, and, in my opinion, he also produces some of the best AI educational videos available today.
His latest project, microGPT, however, is something I can only describe as a work of art.
microGPT is a complete GPT implementation written in only about 200 lines of Python code. And this is not some cleverly compressed or obfuscated code. On the contrary, the code is clean, well-structured, and thoroughly commented. It even avoids using external deep learning libraries. In other words, it not only contains the neural network itself, but also the minimal framework needed for training and running it.
In fact, the whole project doesn’t even have a full repository — it is simply published as a single GitHub Gist.
https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95?embedable=true
In this article, I will try to explain how this beautiful piece of code works in a way that anyone with basic Python knowledge can understand.
Let’s start with the name GPT.
GPT stands for Generative Pretrained Transformer. This architecture forms the foundation of most modern large language models. ChatGPT itself contains this abbreviation in its name, and most major alternatives are currently built on the same underlying idea.
A GPT model predicts the next word (or token) based on the previous words in a text.
This is essentially what all language models do. They generate responses by predicting one word at a time. First, they predict the next word, then they add that word to the text and predict the next one again, and so on.
At first glance, this may sound simple. But if we think about it more carefully, predicting the next word correctly requires a certain level of understanding of the text and its meaning.
In neural networks, this “thinking” process is implemented as a very large mathematical function.
The input words are converted into numbers and fed into this function, which then calculates the next word. In principle, any rule system can be represented mathematically using such functions. So the task becomes finding the right function that captures the structure of language.
This function is what we call the model.
The problem is that this function can be extremely complex. Modern language models contain billions of parameters, meaning the mathematical formula describing them can have billions of components. Such enormous formulas are impossible for humans to design or manipulate manually.
Fortunately, there is a clever workaround.
Instead of writing the exact formula ourselves, we define a “template” for the formula, a structure with many adjustable parameters. These parameters can then be tuned automatically until the model performs well.
We have several such templates, depending on the type of problem we want to solve:
- Convolutional networks are commonly used for image processing
- MLPs (multi-layer perceptrons) are used for general function approximation
- Transformers are used for language models like ChatGPT
- etc.
These templates represent huge mathematical structures that can approximate complex rules once their parameters are properly adjusted.
The remaining question is: How do we adjust these parameters? Or, in other words: How do we “program” a neural network?
Manually adjusting the parameters of a neural network would obviously be impossible, especially when we are dealing with models that contain millions or even billions of parameters.
Fortunately, there is a way to do this automatically, using data. This process is what we call training, and the method that makes it possible is called gradient descent.
If we have a large dataset and a large mathematical formula (our model), we can compute the error of the model for each example in the dataset.
In the case of transformers, the process works roughly like this.
Words are represented as points in a high-dimensional vector space, where words with similar meanings appear closer to each other. Each word corresponds to a point in this space.
The model receives a sequence of words as input and tries to predict the next word. Since words are represented as vectors, we can calculate how far the predicted word is from the correct one. This distance gives us a numerical measure of the error.
And now comes the magic.
Using gradient descent, we can compute how the parameters of the formula should change to reduce this error.
If we have enough data, a sufficiently large network, and we train it for long enough, the formula gradually adapts to the patterns in the dataset and produces increasingly accurate predictions.
But how exactly does this “magic” work? How does gradient descent actually find better parameters?
We can visualize the error as a function in a high-dimensional space, where each dimension corresponds to one parameter of the model.
Thinking in many dimensions is practically impossible for humans, so instead we can imagine a simpler analogy: a landscape of hills and valleys.
In this analogy:
- Every point on the landscape represents a specific combination of model parameters.
- The height of the landscape at that point represents the error of the model.
At the beginning of training, the parameters are initialized randomly. This is like being placed somewhere randomly on the hillside.
Our goal is to minimize the error, which means finding the lowest point in the landscape.
The difficulty is that we do not know what the landscape looks like. It’s as if we are trying to walk down a mountain blindfolded or in thick fog. So what can we do?
This is where gradient descent comes in.
In mathematics, there is a concept called the derivative, which describes the slope of a function at a given point.
This is incredibly useful here. The derivative tells us which direction the landscape slopes downward.
So the algorithm simply does the following:
- Measure the slope of the landscape.
- Take a small step in the direction where the error decreases.
- Repeat.
Step by step, the model gradually walks downhill until it reaches a low point in the error landscape.
This is essentially what gradient descent does.
The remaining question is: How do we compute the slope of such a complex function?
I won’t go into the full mathematical details here, because Karpathy already has an excellent video explaining it.
For our purposes, it is enough to know the following.
Every mathematical operation has a corresponding rule that allows us to compute its local gradient. In addition, there is a rule called the chain rule, which allows us to combine these local gradients to compute the gradient of an entire composite function.
The key idea is that we move backwards through the chain of operations, collecting gradients along the way. This process is called backpropagation.
All major deep learning frameworks implement this mechanism.
For example:
- TensorFlow uses a system called GradientTape, which records operations like a tape recorder so gradients can be computed afterward.
- PyTorch attaches gradient information directly to tensors. Each tensor remembers how it was created, which allows gradients to be computed automatically. This system is called autograd.
Karpathy’s implementation follows the same basic idea. Let’s see how.
# Let there be Autograd to recursively apply the chain rule through a computation graph
class Value:
__slots__ = ('data', 'grad', '_children', '_local_grads') # Python optimization for memory usage
def __init__(self, data, children=(), local_grads=()):
self.data = data # scalar value of this node calculated during forward pass
self.grad = 0 # derivative of the loss w.r.t. this node, calculated in backward pass
self._children = children # children of this node in the computation graph
self._local_grads = local_grads # local derivative of this node w.r.t. its children
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
return Value(self.data + other.data, (self, other), (1, 1))
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
return Value(self.data * other.data, (self, other), (other.data, self.data))
def __pow__(self, other): return Value(self.data**other, (self,), (other * self.data**(other-1),))
def log(self): return Value(math.log(self.data), (self,), (1/self.data,))
def exp(self): return Value(math.exp(self.data), (self,), (math.exp(self.data),))
def relu(self): return Value(max(0, self.data), (self,), (float(self.data > 0),))
def __neg__(self): return self * -1
def __radd__(self, other): return self + other
def __sub__(self, other): return self + (-other)
def __rsub__(self, other): return other + (-self)
def __rmul__(self, other): return self * other
def __truediv__(self, other): return self * other**-1
def __rtruediv__(self, other): return other * self**-1
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._children:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1
for v in reversed(topo):
for child, local_grad in zip(v._children, v._local_grads):
child.grad += local_grad * v.grad
In Karpathy’s code, the entire gradient computation logic is implemented in the Value class, which is only about 40 lines long.
This class is essentially a wrapper around numerical values. Besides storing the data itself, it also stores:
- which values it was computed from (its children),
- and the local gradients needed for backpropagation.
If we look at the code, we can see that the standard operators are redefined:
__add____mul__- and others.
This means that whenever we perform a mathematical operation on Value objects, the program not only computes the result, but also records which values produced it and how the gradient should be propagated.
Finally, the backward() method performs the backpropagation process. It walks backward through the chain of operations and computes the total gradient with respect to the error.
And that’s essentially the whole idea.
The other important component of training is gradient descent itself, which uses the computed gradients to update the parameters of the model.
In other words, this is the part of the code that actually walks down the hill using the slope information.
# Adam optimizer update: update the model parameters based on the corresponding gradients
lr_t = learning_rate * (1 - step / num_steps) # linear learning rate decay
for i, p in enumerate(params):
m[i] = beta1 * m[i] + (1 - beta1) * p.grad
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
m_hat = m[i] / (1 - beta1 ** (step + 1))
v_hat = v[i] / (1 - beta2 ** (step + 1))
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
p.grad = 0
In Karpathy’s implementation, this is done with the Adam optimizer, which is one of the most widely used optimization algorithms in deep learning.
Adam is slightly more sophisticated than basic gradient descent. Instead of taking fixed-size steps, it adapts the step size dynamically based on the history of previous gradients. This makes the training process both faster and more stable.
The gradient computation described earlier and the optimization step together form what we could call the deep learning framework part of the code.
Every neural network — whether it powers a language model, an image generator, a robot controller, or a self-driving car — is trained using essentially the same principle.
These few lines of code capture the core idea behind modern AI.
Large frameworks like:
- TensorFlow
- PyTorch
- JAX
provide highly optimized implementations that can run on GPUs, TPUs, and distributed clusters, and include many additional tricks and optimizations.
But if we strip everything down to the essentials, the underlying principle is exactly the same as what we see in this tiny Python implementation.
Now that we have seen the deep learning framework part of the code, we can move on to the actual neural network, in other words, the GPT model itself.
As mentioned earlier, the goal of GPT is to predict the next token based on the tokens that came before it.
This definition is slightly simplified, because the input does not actually consist of words, but of tokens.
Tokens are very similar to words, but they are not the same. Instead of relying on a predefined dictionary, the system learns a vocabulary statistically from the training data. Tokens are, therefore, character sequences that appear frequently in the dataset.
So training does not begin with a fixed list of words where each word already has a number assigned to it. Instead, this “dictionary” emerges from the data itself during preprocessing.
This approach is especially useful for agglutinative languages, such as Hungarian, where a single word can have many different forms due to suffixes and grammatical endings.
With enough training data, a language model can learn any language, whether natural or artificial.
In fact, tokens do not even have to represent text. They can represent:
- parts of images
- fragments of audio
- sensor readings
- or practically any type of data
Because of this, transformer models are not limited to language processing. They can also be used for image generation, speech processing, robotics, and many other tasks.
Karpathy’s model is intentionally very small, so in this case, the tokens are not words but characters.
The model’s goal is therefore not to generate full sentences or answers, but simply to produce realistic-looking names.
The training dataset consists of a large list of names, and after training, we expect the model to generate new names that statistically resemble the examples.
This is obviously far from a full-scale language model like ChatGPT. But in reality, the difference is mostly one of scale.
If we enlarged this model millions of times, used tokens instead of characters, and trained it on enormous datasets collected from the internet, we would end up with something very similar to a modern large language model.
These large models are typically trained in two stages:
- Pretraining – training on massive internet datasets to learn general language patterns.
- Fine-tuning – further training using curated human-generated conversations to improve responses.
The process requires enormous computational resources and huge amounts of high-quality data, often costing millions of dollars in compute.
Since most of us do not have access to such resources, we will have to settle for generating names.
# Let there be a Dataset `docs`: list[str] of documents (e.g. a list of names)
if not os.path.exists('input.txt'):
import urllib.request
names_url = 'https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt'
urllib.request.urlretrieve(names_url, 'input.txt')
docs = [line.strip() for line in open('input.txt') if line.strip()]
random.shuffle(docs)
print(f"num docs: {len(docs)}")
# Let there be a Tokenizer to translate strings to sequences of integers ("tokens") and back
uchars = sorted(set(''.join(docs))) # unique characters in the dataset become token ids 0..n-1
BOS = len(uchars) # token id for a special Beginning of Sequence (BOS) token
vocab_size = len(uchars) + 1 # total number of unique tokens, +1 is for BOS
print(f"vocab size: {vocab_size}")
At the beginning of the code, we find the section responsible for loading the dataset of names and building the vocabulary.
In this case, the vocabulary simply consists of the list of characters that appear in the dataset.
Each character is assigned a numerical identifier, allowing the text to be converted into numbers that the neural network can process.
Next comes one of the most important concepts in deep learning: embeddings.
The idea is simple but powerful.
Instead of working with raw token IDs, we map each token into a point in a high-dimensional vector space. These vectors are what the neural network actually processes.
In fact, any neural network can be viewed as a function that maps vectors from one high-dimensional space into another.
For example:
- If we train a neural network to classify images as dogs or cats, the network maps the image representation into a two-dimensional space, where one dimension corresponds to “dogness” and the other to “catness”.
- If we imagine an image generator like Midjourney, it maps random noise into a high-dimensional space where each point represents an image conditioned on the prompt.
Regardless of the task, the network is always performing a vector-to-vector transformation using a large mathematical function.
The same is true for GPT.
The dimensionality of the vector space is defined in the code by the constant n_embd, which in this implementation is set to 16.
This means that each token (in this case, each character) is represented as a 16-dimensional vector.
Mathematically, this mapping is simply a matrix multiplication.
In the code, the matrix responsible for this transformation is called wte, which stands for word/token embedding.
However, knowing which characters appear in a word is not enough. Their position also matters.
For example, the meaning of a sequence changes if we rearrange the characters.
To incorporate positional information, the model uses positional embeddings, implemented using another matrix called wpe.
The code maps both the token and its position into 16-dimensional vectors, and then simply adds the two vectors together.
The result is a single vector that encodes both:
-
the identity of the token
-
its position within the sequence
Earlier, we mentioned that these vector representations must be meaningful, because later the model will compute errors based on distances between vectors.
Ideally:
- Almost correct predictions should be close to the correct vector
- Very wrong predictions should be far away from it
This raises an interesting question: How do we design a good embedding space?
The answer is surprisingly simple: We don’t.
Instead, we initialize the embedding matrices (wte and wpe) with random numbers and allow gradient descent to learn the correct representation during training.
If we have enough data, the optimization process will gradually adjust the matrices until they represent useful relationships.
This can lead to surprisingly powerful emergent properties.
For example, in the famous word2vec embedding model, vector arithmetic can capture semantic relationships. A classic example is:
king − man + woman ≈ queen
Here we can already see that the embedding space begins to represent a kind of simplified model of the world, where relationships between concepts appear as geometric relationships between vectors.
Now that we have seen how vectors are created, we can finally look at the neural network itself, the component that transforms these vectors into new vectors representing the next token.
In other words, the network maps vectors from the token embedding space back into the same space, but shifted by one token. For each token, it predicts which token should follow next. By repeatedly applying this process, the model can generate an entire sequence of text — or in this case, a name.
The architecture used for this is called the Transformer.
The Transformer was introduced in 2017 by researchers at Google in the famous paper: “Attention Is All You Need.”
The original architecture was designed for machine translation. It consisted of two main parts:
- an encoder
- a decoder
The encoder processed the input sentence, while the decoder generated the translated output sentence.
For generative models like GPT, however, we only need half of the original architecture: the decoder stack.
This is why GPT models are often described as decoder-only transformers.
The decoder receives the input tokens and repeatedly processes them through a stack of identical layers. Each layer contains two main components:
- Self-attention
- A feed-forward neural network (MLP)
These layers are repeated many times in large models. In diagrams, you often see this represented as ×N, meaning the block is stacked multiple times.
One of the key innovations of the Transformer architecture is that it processes the entire sequence at once.
Older language models, especially recurrent neural networks (RNNs), processed text one word at a time, sequentially passing information along the sequence.
Transformers work differently. They can look at all tokens simultaneously, allowing the model to learn relationships between any parts of the text.
This mechanism is called attention, which is why the original paper was titled Attention Is All You Need.
The attention mechanism calculates how relevant each token is to every other token in the sequence.
For each token vector, the model computes a set of weights describing how much attention it should pay to the other tokens. It then combines the information from those tokens accordingly.
The resulting vector therefore represents not only the token itself, but also its meaning in the context of the entire sequence.
This may sound complicated, but the intuition is straightforward.
Suppose we ask a language model: “What is the capital of France?”
If we only looked at the word “capital”, we could not determine the answer. But attention allows the model to connect the word “capital” with “France”.
The resulting representation captures the meaning of the phrase “capital of France”, making it possible for the model to produce the correct answer: Paris.
One way to think about transformers is to imagine them as a kind of soft database.
Instead of storing explicit facts, the model stores knowledge in a vector space representation. Because neural networks approximate functions rather than memorize exact rules, they can often answer questions they have never seen before.
Returning to our earlier embedding example: If the training data contains information about kings and women, the model may still be able to answer questions about queens, because the relationships between these concepts are captured in the vector space.
If we follow this database analogy, we might say:
- Attention acts like an index, helping the model locate relevant information.
- The MLP layers contain the knowledge itself.
This mental model is useful for intuition, but it is not literally correct.
In a real transformer like the one used in ChatGPT, these attention + MLP blocks are repeated many times.
Knowledge is not stored in a single location but is distributed across layers.
Additionally, each layer includes a residual connection, which mixes the original input vectors with the newly computed vectors. This allows information to flow through the network more effectively and stabilizes training.
As the vectors pass through the layers, new abstractions and meanings can emerge. By the time the final layer produces its output, the model has combined information from many different levels of representation.
The full process is far too complex to follow step by step with human intuition.
Yet despite this complexity, the system works remarkably well in practice.
Now that we have a rough intuition about the transformer architecture, let’s look at one of its most important components in more detail: attention.
# 1) Multi-head Attention block
x_residual = x
x = rmsnorm(x)
q = linear(x, state_dict[f'layer{li}.attn_wq'])
k = linear(x, state_dict[f'layer{li}.attn_wk'])
v = linear(x, state_dict[f'layer{li}.attn_wv'])
keys[li].append(k)
values[li].append(v)
x_attn = []
for h in range(n_head):
hs = h * head_dim
q_h = q[hs:hs+head_dim]
k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
v_h = [vi[hs:hs+head_dim] for vi in values[li]]
attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5 for t in range(len(k_h))]
attn_weights = softmax(attn_logits)
head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h))) for j in range(head_dim)]
x_attn.extend(head_out)
x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
x = [a + b for a, b in zip(x, x_residual)]
In Karpathy’s implementation, attention is calculated using three matrices called:
- Q (Query)
- K (Key)
- V (Value)
These matrices perform vector projections. In other words, each token vector is mapped into three different vector spaces.
For every token we compute:
- a query vector
- a key vector
- a value vector
Once we have these vectors, we compare the query vector of one token with the key vectors of all tokens in the sequence.
Mathematically, this is done using a dot product. The dot product gives a score that represents how strongly two vectors are related.
This produces a set of numbers that represent how relevant each token is to the current token.
However, these raw scores are not yet probabilities. To convert them into a probability distribution, we apply the softmax function, which transforms the scores into values between 0 and 1 that sum to 1.
These values represent how much attention each token should receive.
Finally, the model combines the value vectors using these attention weights, producing a new vector that contains information gathered from the entire context.
The formula for scaled dot-product attention looks like this:
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V
Here:
- QKᵀ computes the similarity between queries and keys
- √dₖ is a scaling factor that stabilizes training
- softmax converts the scores into attention probabilities
- V provides the information that is combined according to those probabilities
The result is a new representation for each token that reflects its meaning in the context of the entire sequence.
At this point, it is worth mentioning an important concept: context length.
As we discussed earlier, transformers process the entire sequence at once. This is necessary because attention requires comparing every token with every other token.
That means the computational cost grows quadratically with the number of tokens.
If we double the context length, the amount of computation increases roughly four times.
This is one of the main limitations of transformer models.
Unlike some other architectures, transformers do not have a separate memory system. They can only “see” the tokens that fit within their context window.
Everything outside that window is effectively invisible to the model.
This is why context length is such an important property of modern language models.
In many modern AI systems, this limitation is addressed by adding an external memory mechanism.
A common approach is to use a vector database.
Instead of storing knowledge directly in the model, information can be stored externally as vector embeddings.
When the model receives a question, the system can:
- Convert the question into a vector.
- Search the vector database for related information.
- Insert the retrieved information into the model’s context.
This means the model sees both:
- the question
- and the relevant knowledge retrieved from the database
Because both appear in the context window, the model can generate an answer based on that information.
This technique is known as Retrieval-Augmented Generation (RAG) and is widely used in modern AI systems and agent frameworks.
In this setup, the language model’s main role is not to store knowledge, but to generate coherent answers based on the information available in its context.
But as we can see, this requires space in the context window, which is why context length remains so important.
Returning to Karpathy’s implementation, the model uses multi-head attention, which is an improved form of the basic attention mechanism.
Instead of computing attention using a single set of Q, K, and V matrices, the model uses multiple attention heads.
In this implementation, there are four heads.
Each head learns to focus on different kinds of relationships between tokens. For example, one head might focus on grammatical relationships, while another might capture longer-range dependencies.
Using multiple heads improves the quality of the representation.
To keep the computational cost roughly the same, the dimensionality of each head is reduced.
Earlier, we mapped vectors from a 16-dimensional space to another 16-dimensional space.
With four attention heads, each head instead works in a 4-dimensional space. The results from the heads are then combined back into a single vector.
Although each head works with lower-dimensional vectors, the combined result is typically more expressive and more accurate than using a single attention head.
Now that we have covered the attention mechanism, let’s move on to the second major component of the transformer block: the MLP, or feed-forward neural network.
In the code, the MLP block looks something like this:
# 2) MLP block
x_residual = x
x = rmsnorm(x)
x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
x = [xi.relu() for xi in x]
x = linear(x, state_dict[f'layer{li}.mlp_fc2'])
x = [a + b for a, b in zip(x, x_residual)]
An MLP is a classic neural network architecture. If we look at the structure of the weight matrices, we can interpret the rows of the matrix as neurons.
A neuron is a simple computational unit that:
- multiplies each input by a weight,
- sums the results,
- Then it applies a nonlinear activation function to produce the output.
This model was originally inspired by biological neurons in the human brain. In that sense, neural networks were historically motivated by attempts to mimic how the brain might process information.
However, modern AI systems have moved quite far from this original analogy.
In the MLP component, we can still loosely recognize something that resembles neurons. But when we look at mechanisms like attention, it becomes much harder to maintain the brain-inspired interpretation.
Because of this, it is often better to think of modern AI systems simply as trainable mathematical functions, rather than literal models of the brain.
The MLP block in the code consists of three main steps:
- a linear transformation (matrix multiplication),
- a nonlinear activation function (ReLU),
- another linear transformation.
This may look simple, but structures like this have an extremely powerful mathematical property.
They are known as universal approximators.
This means that, under certain conditions, a sufficiently large MLP can approximate any mathematical function to an arbitrary degree of accuracy.
In principle, a single enormous MLP could learn almost anything.
However, that would not be very efficient in practice. That is why transformer architectures combine multiple mechanisms, including attention and stacked layers, to distribute the computation more effectively.
The output of the network is not a single token, but a probability distribution over all possible tokens.
In other words, for each token in the vocabulary, the model outputs the probability that it should appear next in the sequence.
During generation, the algorithm then samples from this probability distribution.
This means that tokens with higher probability are more likely to be chosen, but there is still an element of randomness.
This randomness is controlled by a parameter called temperature.
The temperature parameter determines how deterministic or creative the model’s output will be.
- Low temperature - the model strongly favors the most probable tokens, producing more predictable and accurate responses.
- High temperature - the probability distribution becomes flatter, allowing less likely tokens to be selected more often, resulting in more diverse or creative outputs.
For example:
- If we want the model to analyze a document and answer factual questions, a low temperature is usually preferable.
- If we want the model to generate creative text or explore new ideas, a higher temperature can produce more interesting results.
This is roughly what I wanted to explain about this beautiful piece of code, and about GPT models in general.
In many places, the explanation necessarily remained somewhat superficial. My goal was to strike a balance between two things: including as much useful insight as possible, while still keeping the discussion within the scope of a single article
For readers who found parts of the explanation a bit unclear, or who want to explore the details more deeply, I highly recommend Andrej Karpathy’s personal website, his YouTube channel, and his blog. There, you will find excellent material explaining everything needed to fully understand the concepts discussed here.
I hope this article was useful to many readers. If nothing else, perhaps it serves as an invitation to explore the fascinating world of AI.