microgpt — guided walkthrough
A 200-line GPT, taken apart and rebuilt in front of you.
Before we can begin evaluating and auditing AI systems, we have to understand them from first principles. On Feb 12, 2026, Andrej Karpathy (co-founder at OpenAI; helped build Tesla Autopilot) released a 200-line pure-Python program implementing the fundamental ideas behind GPT. I've taken his post and turned it into a lab with exercises and visuals to help us understand the concepts deeply rather than skim them. Karpathy's post is already well written — the goal is to augment it. The Python here is also rewritten in a slightly less compressed style: ~2XX lines instead of 200, but a bit easier to read. As always, feel free to work with the people at your table. You've got this.
Where to find it
- GitHub gist with the full source code:
microgpt.py - Also available on this web page: karpathy.ai/microgpt.html
- Also available as a Google Colab notebook — you can run it without installing anything
The following is a guide that steps an interested reader through the code.
Dataset
The fuel of large language models is a stream of text data, optionally separated into a set of documents. In production-grade applications, each document would be an internet web page — but for microgpt, we use a simpler example of 32,000 names, one per line:
# Let there be an input dataset `docs`: list[str] of documents (e.g. a dataset of names)
if not os.path.exists('input.txt'):
import urllib.request
names_url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt'
urllib.request.urlretrieve(names_url, 'input.txt')
docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)
print(f"num docs: {len(docs)}")
The dataset looks like this. Each name is a document:
emma
olivia
ava
isabella
sophia
charlotte
mia
amelia
harper
... (~32,000 names follow)
The goal of the model is to learn the patterns in the data and then generate similar new documents that share the statistical patterns within. As a preview, by the end of the script our model will generate ("hallucinate"!) new, plausible-sounding names. Skipping ahead, we'll get:
sample 1: kamon sample 8: anna sample 15: earan
sample 2: ann sample 9: areli sample 16: lenne
sample 3: karai sample 10: kaina sample 17: kana
sample 4: jaire sample 11: konna sample 18: lara
sample 5: vialan sample 12: keylen sample 19: alela
sample 6: karia sample 13: liole sample 20: anton
sample 7: yeran sample 14: alerin
It doesn't look like much, but from the perspective of a model like ChatGPT, your conversation with it is just a funny-looking "document". When you initialize the document with your prompt, the model's response from its perspective is just a statistical document completion.
Tokenizer
Under the hood, neural networks work with numbers, not characters, so we need a way to convert text into a sequence of integer token ids and back. Production tokenizers like tiktoken (used by GPT-4) operate on chunks of characters for efficiency, but the simplest possible tokenizer just assigns one integer to each unique character in the dataset:
# Let there be a Tokenizer to translate strings to discrete symbols and back
uchars = sorted(set(''.join(docs))) # unique characters become token ids 0..n-1
BOS = len(uchars) # token id for Beginning of Sequence
vocab_size = len(uchars) + 1 # total tokens, +1 for BOS
print(f"vocab size: {vocab_size}")
We collect all unique characters across the dataset (which are just the lowercase letters a–z), sort them, and each letter gets an id by its index. The integer values themselves carry no meaning — each token is just a discrete symbol. Instead of 0, 1, 2 they could be different emoji. We also create one special token, BOS (Beginning of Sequence), which acts as a delimiter: it tells the model "a new document starts/ends here". Later during training, each document gets wrapped with BOS on both sides: [BOS, e, m, m, a, BOS]. The model learns that BOS initiates a new name, and that another BOS ends it. So we have a vocabulary of 27 (26 lowercase letters + BOS).
The character "a" is the first alphabet letter, so it has id 0. What's the id of "z"? Of "BOS"? If your full name has 9 letters, how many tokens does the model see when you train on it?
Show answer
uchars = 26 alphabet letters). A 9-letter name produces 9 + 2 = 11 tokens: BOS, the 9 letters, then BOS again.From a neuron to a network
Before we open up gpt() and stare at multi-head attention, let's build up the underlying object — the neuron — and stack neurons into a network. The end goal of this section: by the time we hit the architecture diagram, every box in it will feel like an obvious composition of things we already understand.
Here's roughly where we're going. Don't worry about the details — file the picture mentally, then we'll build to it. (You can already play with this — drag the input sliders and watch the activations propagate.)
The simplest "neuron"
One input x, one bias b, and an output a = x + b. That's it — just an adder. No learning yet, no bend in the output. It's a useful starting object because every more complex neuron is just this one with more parts bolted on.
def neuron(x, b):
return x + b
If x = 3 and b = -1, what does the neuron output? What if I want this neuron to always output 0 no matter the input? What b would I need (and would it work for every x)?
Show answer
3 + (−1) = 2. To force the output to 0 we'd need b = −x, which depends on x — a single bias can't do it. That's why we'll add a weight next: it lets the neuron scale its input before the bias.Add a weight
Multiply the input by a learned weight w before adding the bias: a = x*w + b. Now the neuron has two knobs. With both w and b the neuron can shift and scale — it can learn any affine [affine = scale the input, then shift it] response. This is the canonical "linear neuron".
def neuron(x, w, b):
return x * w + b
Add a nonlinearity (ReLU)
Stacking linear neurons on top of linear neurons just gives you another linear function. To learn interesting things, we need a nonlinearity. ReLU is the simplest: $f(z) = \max(0, z)$. It passes positive values through and zeros out negative ones.
def relu(z):
return max(0, z)
def neuron(x, w, b):
z = x * w + b
a = relu(z)
return a
With w = 2 and b = -3, plug in x = 1 and x = 4. What does the neuron output in each case? At what value of x does the ReLU "turn on" — i.e., where does the output stop being zero?
Show answer
x = 1 → z = 1·2 − 3 = −1 → a = max(0, −1) = 0. x = 4 → z = 5 → a = 5. The ReLU turns on at z = 0, i.e. when x = 3/2 = 1.5. The neuron has learned a soft threshold.Many inputs in, one output out
Real neurons take a vector of inputs. Each input x_i has its own weight w_i; the neuron sums them up, adds bias, and applies ReLU:
$$ a = \mathrm{ReLU}\!\left(\sum_{i=1}^{n} x_i w_i + b\right) $$
def neuron(x, w, b): # x and w are lists of length n
z = sum(xi * wi for xi, wi in zip(x, w)) + b
return max(0, z)
The inner sum is a dot product — the fundamental operation of neural networks. In microgpt, linear(x, w) does this dot product once per row of w. (Karpathy's version drops the bias b — modern Transformers often do.)
zip() do?
Python's built-in zip() walks through two (or more) lists in lockstep and hands back tuples of matching elements — one tuple per "column" — stopping when the shortest list runs out. So for xi, wi in zip(x, w) gives us the i-th input and the i-th weight together on each loop iteration, ready to multiply.
zip(x, w) ↓The dot product is then just "sum the products of each pair": $0.5{\cdot}0.4 + (-0.3){\cdot}0.7 + 1.2{\cdot}(-0.1) = 0.20 - 0.21 - 0.12 = -0.13$.
The same pattern shows up everywhere in microgpt — adding token + position embeddings (zip(tok_emb, pos_emb)), residual sums (zip(x, x_residual)), every matrix-vector multiply inside linear(). Anywhere you see two same-length lists walked together, zip is the glue.
Forward pass
In a neural network, the forward pass is the trip from inputs to a prediction. You hand the network some numbers, they flow through every layer — getting multiplied by weights, summed with biases, occasionally bent by a nonlinearity — and out the other end falls a single answer. The forward pass doesn't change the network at all; it just runs it. Every weight stays exactly where it was; only the activations move.
It's worth pausing on this before we get to backprop, because backprop is just the forward pass run in reverse. If we can't picture the forward pass clearly, the backwards version will feel like magic.
Below is a deliberately tiny network so you can wiggle every knob and watch the output respond. Three inputs x₁, x₂, x₃ feed into two hidden ReLU neurons that join at a single ReLU output a. The three weights and three biases (w₁, w₂, w₃, b₁, b₂, b₃) are yours to play with. As you change them, the prediction surface on the right re-draws — it plots a as a height over the (x₁, x₂) plane, with x₃ swept by its slider. The forward pass is that mapping from input space to output.
x₃ with its slider to lift / fold the surface. Because every neuron has a ReLU, the surface is piecewise linear — each ReLU contributes a sharp fold. Click and drag the surface to rotate.h₁ = ReLU(w₁·x₁ + w₂·x₂ + b₁)
h₂ = ReLU(w₃·x₃ + b₂)
a = ReLU(h₁ + h₂ + b₃)
The same thing, in code
Here's the network we've been playing with, written out as a small class hierarchy: Neuron → Layer → MLP. This is essentially how Karpathy's micrograd packages neural networks. The Neuron.__call__ method is doing exactly what the circles in the diagram do — weighted sum of inputs, plus bias, through a ReLU.
import random
class Neuron:
def __init__(self, nin):
self.w = [random.uniform(-1, 1) for _ in range(nin)]
self.b = random.uniform(-1, 1)
def __call__(self, x):
# forward pass: a = ReLU(w · x + b)
z = sum(wi * xi for wi, xi in zip(self.w, x)) + self.b
return max(0, z)
class Layer:
def __init__(self, nin, nout):
self.neurons = [Neuron(nin) for _ in range(nout)]
def __call__(self, x):
return [n(x) for n in self.neurons]
class MLP:
def __init__(self, nin, nouts):
sizes = [nin] + nouts
self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(nouts))]
def __call__(self, x):
for layer in self.layers:
x = layer(x)
return x
# 3 inputs → 2 hidden neurons → 1 output (one forward pass)
x = [1.0, 0.5, -0.3]
mlp = MLP(3, [2, 1])
print(mlp(x)) # e.g. [0.42]
The MLP(3, [2, 1]) above is slightly more general than the network in the diagram. In a standard MLP every input feeds every hidden neuron, so the first layer alone would have 2 × (3 weights + 1 bias) = 8 parameters. The interactive diagram uses a deliberately restricted variant — h₁ sees only x₁, x₂, and h₂ sees only x₃ — so we end up with just 3 weights and 3 biases. That's small enough that the prediction surface stays readable as you wiggle the sliders. The Neuron / Layer / MLP scaffolding is identical either way.
Here's a small batch of inputs. Using the MLP class above, write code that produces predictions for each one:
xs = [
[ 2.0, 3.0, -1.0],
[ 3.0, -1.0, 0.5],
[ 0.5, 1.0, 1.0],
[ 1.0, 1.0, -1.0],
]
ys_target = [1.0, -1.0, -1.0, 1.0] # what we WISH the network said
ypred = ? # ← your job
Show answer
ypred = [mlp(x) for x in xs]. With random weights you'll get whatever the freshly-initialized model says — almost certainly nothing like ys_target.Bonus observation: our network's output is wrapped in a ReLU, so
ypred[i] ≥ 0 for every input. That means we can never match a target of −1.0 no matter what the weights are. To handle negative targets we'd need a different output activation (or none). This is a real design choice in real models — the output activation has to match the kind of answer you want.What is loss?
Once we have predictions, the obvious question is: how wrong are we? The standard way to turn that question into a single number is a loss function. The simplest one — mean squared error (MSE) — just averages the squared gap between each prediction and its target:
$$ L = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 $$
A few properties worth internalizing:
- Loss is always ≥ 0 — squared gaps can't be negative.
- Loss = 0 means perfect predictions — every
ŷᵢexactly hits its targetyᵢ. - Big gaps cost much more than small ones — because they're squared. A model that's off by 2 on one example loses 4× more than one that's off by 1.
- Loss is the only thing the optimizer cares about — every weight in the model will be nudged in whichever direction makes this single number smaller.
This is the whole game of training: find weights that minimize the loss.
Scroll back to the interactive diagram and click 🎯 Train against a target surface. A hidden target network is generated, its surface is overlaid as a dark wireframe, and the live loss appears as both a number and a bar. The little chart underneath records every loss reading — as you nudge sliders, you can literally watch the line go down (or up — easy to make it worse). See if you can get the loss below 0.02 by hand. It's harder than it looks — and that's the whole motivation for the gradient-based training we'll build in the next section.
The weights and biases in our code are still plain Python floats, so we can run the model and measure the loss but we can't yet ask "which weight should I nudge, and by how much, to reduce the loss?". To answer that, we need gradients — and that's exactly what the next section is about.
Autograd
Training a neural network requires gradients: for each parameter in the model, we need to know "if I nudge this number up a little, does the loss go up or down, and by how much?". The computation graph has many inputs (the model parameters and input tokens) but funnels down to a single scalar output: the loss. Backpropagation starts at that single output and works backwards through the graph, computing the gradient of the loss with respect to every input. It relies on the chain rule from calculus. In production, libraries like PyTorch handle this automatically. Here, we implement it from scratch in a single class called Value.
This is the most mathematically intense part of microgpt. Karpathy has a 2.5-hour video that builds the whole thing live: The spelled-out intro to neural networks and backpropagation. The walk-through below condenses the key points.
Building Value piece by piece
The same Lego mindset works here: start with a wrapper, add operators, then add the graph bookkeeping that makes backprop possible. Try it live:
Value remembers at each version of the class. Stage 3 is what microgpt actually uses.__repr__ is the dunder Python calls when you print an object.class Value:
def __init__(self, data):
self.data = data
def __repr__(self):
return f"Value(data={self.data})"
a = Value(-6.0)
b = Value(7.0)
print(a) # Value(data=-6.0)
print(b) # Value(data=7.0)
Value arithmetic. __add__ and __mul__ are the dunders Python calls when you write a + b or a * b. We return a fresh Value.class Value:
def __init__(self, data):
self.data = data
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
return Value(self.data + other.data)
def __mul__(self, other):
return Value(self.data * other.data)
a = Value(-6.0); b = Value(7.0); c = Value(10.0)
d = a * b + c
print(d) # Value(data=-32.0)
_children.class Value:
def __init__(self, data, children=()):
self.data = data
self._children = children # the values that produced this one
def __add__(self, other):
return Value(self.data + other.data, (self, other))
def __mul__(self, other):
return Value(self.data * other.data, (self, other))
a = Value(2.0)
b = Value(3.0)
c = a * b # c knows its children are (a, b)
L = c + a # L knows its children are (c, a)
backward() walks the graph in reverse topological order, applying the chain rule and accumulating gradients.class Value:
__slots__ = ('data', 'grad', '_children', '_local_grads')
def __init__(self, data, children=(), local_grads=()):
self.data = data # forward-pass scalar
self.grad = 0 # dL/d(this), filled in backward pass
self._children = children # inputs to this node
self._local_grads = local_grads # d(this)/d(child) for each child
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
return Value(self.data + other.data, (self, other), (1, 1))
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
return Value(self.data * other.data, (self, other), (other.data, self.data))
def __pow__(self, other): return Value(self.data**other, (self,), (other * self.data**(other-1),))
def log(self): return Value(math.log(self.data), (self,), (1/self.data,))
def exp(self): return Value(math.exp(self.data), (self,), (math.exp(self.data),))
def relu(self): return Value(max(0, self.data), (self,), (float(self.data > 0),))
def __neg__(self): return self * -1
def __radd__(self, other): return self + other
def __sub__(self, other): return self + (-other)
def __rsub__(self, other): return other + (-self)
def __rmul__(self, other): return self * other
def __truediv__(self, other): return self * other**-1
def __rtruediv__(self, other): return other * self**-1
def backward(self):
# 1) Build reverse-topological order via DFS
topo, visited = [], set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._children:
build_topo(child)
topo.append(v)
build_topo(self)
# 2) Seed the loss gradient, then propagate
self.grad = 1
for v in reversed(topo):
for child, local_grad in zip(v._children, v._local_grads):
child.grad += local_grad * v.grad
Briefly, a Value wraps a single scalar number (.data) and tracks how it was computed. Think of each operation as a little Lego block: it takes some inputs, produces an output (the forward pass), and it knows how its output would change with respect to each of its inputs (the local gradient). That's all the information autograd needs from each block. Everything else is just the chain rule, stringing the blocks together.
Every time you do math with Value objects (add, multiply, etc.), the result is a new Value that remembers its inputs (_children) and the local derivative of that operation (_local_grads). For example, __mul__ records that $\frac{\partial(a\cdot b)}{\partial a}=b$ and $\frac{\partial(a\cdot b)}{\partial b}=a$. The full set of Lego blocks:
| Operation | Forward | Local gradients |
|---|---|---|
a + b | $a+b$ | $\partial/\partial a = 1,\; \partial/\partial b = 1$ |
a * b | $a \cdot b$ | $\partial/\partial a = b,\; \partial/\partial b = a$ |
a ** n | $a^n$ | $\partial/\partial a = n\,a^{n-1}$ |
log(a) | $\ln a$ | $\partial/\partial a = 1/a$ |
exp(a) | $e^a$ | $\partial/\partial a = e^a$ |
relu(a) | $\max(0,a)$ | $\mathbf{1}_{a>0}$ |
The backward() method walks this graph in reverse topological order (starting from the loss, ending at the parameters), applying the chain rule at each step. If the loss is $L$ and a node $v$ has a child $c$ with local gradient $\frac{\partial v}{\partial c}$, then:
$$\frac{\partial L}{\partial c} \mathrel{+}= \frac{\partial v}{\partial c}\cdot\frac{\partial L}{\partial v}$$
This looks scary if you're not comfortable with calculus, but it's literally just multiplying two numbers in an intuitive way: "If a car travels twice as fast as a bicycle, and the bicycle is four times as fast as a walking man, then the car travels 2×4 = 8 times as fast as the man." The chain rule is the same idea — you multiply the rates of change along the path.
We kick things off by setting self.grad = 1 at the loss node, because $\frac{\partial L}{\partial L}=1$. From there, the chain rule just multiplies local gradients along every path back to the parameters.
Note the += (accumulation, not assignment). When a value is used in multiple places in the graph (i.e. the graph branches), gradients flow back along each branch independently and must be summed. This is the multivariable chain rule: if $c$ contributes to $L$ through multiple paths, the total derivative is the sum of contributions from each path.
After backward() completes, every Value in the graph has a .grad containing $\frac{\partial L}{\partial v}$, which tells us how the final loss would change if we nudged that value.
Watch backprop happen
Backprop is easier to internalize if you build it up. Below are four cases in increasing complexity — start with what a single + does to a gradient, then a single ×, then both with a branch, then a full training-style pipeline (input, prediction, loss). Each tab is its own little graph; step through it one click at a time.
Here's a small neuron computing a = ReLU(x·w + b). The forward values are filled in. Try to compute the gradients with respect to x, w, and b by hand assuming ∂L/∂a = 1. Then click "Run backward" to check. Doing this once by hand is the single best way to internalize what backward() is doing.
This is exactly what PyTorch's .backward() gives you:
import torch
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
c = a * b
L = c + a
L.backward()
print(a.grad) # tensor(4.)
print(b.grad) # tensor(2.)
This is the same algorithm that PyTorch's loss.backward() runs, just on scalars instead of tensors (arrays of scalars) — algorithmically identical, significantly smaller and simpler, but a lot less efficient.
Let's spell out what backward() gives us. Autograd calculated that if L = a*b + a, with a=2 and b=3, then a.grad = 4.0. This is telling us about the local influence of a on L: if you wiggle a, in what direction is L changing? The derivative of L w.r.t. a is 4.0, meaning that if we increase a by a tiny amount (say 0.001), L would increase by about 4× that (0.004). Similarly, b.grad = 2.0 means the same nudge to b would increase L by about 2× that. These gradients tell us the direction (positive or negative) and the steepness (magnitude) of each input's influence on the final output (the loss). This lets us iteratively nudge the parameters of our neural network to lower the loss, and hence improve its predictions.
∂loss/∂b = −6 and (chaining through x) ∂loss/∂w = 2·err·x = −18. Now we repeat the step. x = 3 and the target y = 10 are fixed; each step nudges the two parameters against their gradient — w ← w − lr·∂loss/∂w — and you watch the prediction ŷ climb toward the target while the loss shrinks. The nudge is just the gradient multiplied by the learning rate. Click Next step a few times.
∂) recompute every step:Architecture
The model architecture is a stateless function: it takes a token, a position, the parameters, and the cached keys/values from previous positions, and returns logits (scores) over what token the model thinks should come next in the sequence. We follow GPT-2 with minor simplifications: RMSNorm instead of LayerNorm, no biases, and ReLU instead of GeLU.
We'll step through the model one block at a time. Each sub-section below covers one piece — first the intuition, then any small helper functions it needs, then the relevant code, the actual parameter matrices, and finally a small interactive widget showing what we've built up so far.
vocab_size4 (BOS, a, b, c)27 (a–z + BOS)50,257n_embd · d_model216768n_head1412head_dim2464block_size (context len)4161,024n_layer1112To make each step concrete, we'll track a single token through the whole block using a deliberately tiny model. The vector at each stage will only have two numbers, so you can do every multiplication by hand and watch what changes.
Setup. Pretend the vocabulary is just 4 tokens — BOS=0, 'a'=1, 'b'=2, 'c'=3 — and the embedding width is d_model = 2, with n_head = 1 (so head_dim = 2) and block_size = 4. We're partway through generating: the model has already seen BOS at position 0 and 'a' at position 1, and now it's processing 'b' at position 2. We want it to predict what comes at position 3.
Each subsection below pulls in the toy weights it needs, walks the numbers forward, and the resulting vector becomes the input to the next subsection. By the end of Output, we'll have one concrete probability over the 4-token vocab.
Embeddings
The neural network can't process a raw token id like 2 directly. It only works with vectors (lists of numbers). So we associate a learned vector with each possible token, and feed that in as its neural signature. The token id and position id each look up a row from their respective embedding tables (wte and wpe). These two vectors are added together, giving the model a representation that encodes both what the token is and where it is in the sequence. Modern LLMs usually skip the position embedding and use relative-based positioning schemes like RoPE.
Concrete example: say our current token is 'b', which the tokenizer mapped to id 2, sitting at position 2. The lookup wte[2] gives a length-2 vector — that's the x the network actually sees. Click a different letter below and you'll watch a different row of wte get pulled in and flow all the way through the three views (and the numeric tour at the bottom of the section).
wte becomes x; wpe[pos=2] gets added; that vector flows through every downstream view. The fine-grained sliders at the bottom of the section still work for off-vocabulary values.Parameter matrices
Two learned tables — one row per token, one row per position. Hover any cell to see its value. The pattern is just random Gaussian initialisation (std = 0.08); training reshapes these into something meaningful.
Helper used here · rmsnorm
Once we've added the token and position vectors, we normalize. rmsnorm (Root Mean Square Normalization) rescales a vector so its values have unit root-mean-square. This keeps activations from growing or shrinking as they flow through the network, stabilizing training. It's a simpler variant of the LayerNorm used in the original GPT-2.
def rmsnorm(x):
ms = sum(xi * xi for xi in x) / len(x)
scale = (ms + 1e-5) ** -0.5
return [xi * scale for xi in x]
Code in gpt()
tok_emb = state_dict['wte'][token_id] # length 16
pos_emb = state_dict['wpe'][pos_id] # length 16
x = [t + p for t, p in zip(tok_emb, pos_emb)]
x = rmsnorm(x)
Our token is 'b' (id 2) at position 2. Pick tiny wte and wpe tables to look up from:
# wte: 4 rows (one per token), each a length-2 vector
wte = [[ 0.20, 0.30], # BOS
[ 0.50, -0.10], # 'a'
[-0.30, 0.40], # 'b'
[ 0.10, 0.20]] # 'c'
# wpe: 4 rows (one per position), each a length-2 vector
wpe = [[ 0.10, -0.05], # pos 0
[ 0.05, 0.15], # pos 1
[-0.10, 0.10], # pos 2
[ 0.15, 0.00]] # pos 3
token_id, pos_id = 2, 2
tok_emb = wte[token_id] # → [-0.30, 0.40]
pos_emb = wpe[pos_id] # → [-0.10, 0.10]
x = [t + p for t, p in zip(tok_emb, pos_emb)] # → [-0.40, 0.50]
x = rmsnorm(x) # → [-0.88, 1.10]
Doing the RMSNorm by hand. Mean-square: $((-0.40)^2 + 0.50^2)/2 = 0.205$. Scale: $1/\sqrt{0.205 + 10^{-5}} \approx 2.209$. Multiply through: $[-0.40 \cdot 2.209,\; 0.50 \cdot 2.209] \approx [-0.88, 1.10]$. That two-number vector $x \approx [-0.88,\, 1.10]$ is what the attention block sees next.
In microgpt (n_embd = 16): wte is (27 × 16) and wpe is (16 × 16), so the looked-up vectors are length 16 instead of 2 — same two lines of code, just longer lists. RMSNorm averages 16 squared values instead of 2.
In GPT-2 small (n_embd = 768): each row is a 768-dim vector, and the vocabulary jumps to 50,257 tokens, so wte alone is ≈ 39M parameters. GPT-3 (175B): n_embd = 12,288 and the context window stretches to 2,048 positions; modern frontier models push past 100K positions and skip wpe entirely in favor of relative position schemes like RoPE that rotate the Q/K vectors inside attention instead of adding a position vector here.
x + wpe vector to unit root-mean-square, so the activations don't blow up as they flow into Q/K/V. Same picker drives this view — try BOS / 'a' / 'b' / 'c' and watch the normalized vector update.Attention block
The attention block is the only place where a token at position $t$ gets to "look" at tokens at positions $0 \ldots t-1$. It's a token-communication mechanism. Before we dive into the code, here's the intuition that makes the rest of this section click.
Intuition · attention is a fuzzy dictionary
Here is what the attention equation looks like. Don't get intimidated — we're going to break each piece down. Attention is a "learnable", "fuzzy" version of a key-value store — the same data structure you know as a Python dict or a hashtable.
$$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$$
"Italy" matches the key "Italy" exactly, the corresponding value "Rome" is returned. No partial match, no in-between.Attention generalizes this to a non-binary lookup. Instead of matching the query to exactly one key, the query is compared to every key, each match gets a similarity score, and the output is a weighted blend of all the values — keys with higher scores contribute more. Critically, the queries, keys, and values are D-dimensional learned vectors (computed by Wq, Wk, Wv from the input), so the model gets to decide what "matching" means.
q is compared with every key Ki to produce a similarity score Si. Softmax normalizes those scores into weights between 0 and 1, summing to 1 (a well-behaved probability distribution). Each value Vi is multiplied by its score and the products are summed — this weighted sum is the attention output.Why softmax? Raw dot-product scores can be any real number. Softmax squashes them into the range [0, 1] and forces them to sum to 1, like a well-behaved probability distribution — so the output really is a weighted average, not just a weighted sum that could explode.
What does attention do?
Attention is applied to the input sequence and generates weights for what is of importance to each query. Those weights then "pick" the relevant information and pass it on to the next layer. To make this concrete, take the sentence "The quick brown fox jumps over the lazy dog." Click any word below to see where its attention goes — every other word in the sentence gets a similarity score against your chosen query word, and the bar chart shows the resulting weights.
Wq) is dot-producted with every word's key vector (computed by Wk) to get raw scores; softmax turns those into the attention weights you see below. The highest-weighted word is what this query is "looking at." Numbers are illustrative — a real trained model would produce its own pattern.Helpers used here · linear and softmax
linear is a matrix-vector multiply. It takes a vector x and a weight matrix w, and computes one dot product per row of w. It shows up four times in this block — once each for Q, K, V, and the output projection Wo — and is the fundamental building block of neural networks: a learned linear transformation.
def linear(x, w):
return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]
softmax converts a vector of raw scores — which can range from $-\infty$ to $+\infty$ — into a probability distribution: all values end up in $[0,1]$ and sum to 1. Inside attention we use it to turn the Q·K scores into weights that sum to 1; later, the same helper turns the model's output logits into a distribution over the vocabulary. We subtract the max first for numerical stability (mathematically a no-op, but it prevents overflow in exp).
def softmax(logits):
max_val = max(val.data for val in logits)
exps = [(val - max_val).exp() for val in logits]
total = sum(exps)
return [e / total for e in exps]
Now that both helpers are on the table, let's walk through the whole attention block with concrete numbers before opening up the interactive widgets. The widgets below are just visualizations of the operations that follow — once you've seen the math run end-to-end on real values, each widget will feel like a labeled view of a step you've already done by hand.
We pick up where the Embeddings walkthrough left off: token 'b' at position 2, with x ≈ [-0.88, 1.10] already in hand.
The embedding step handed us x ≈ [-0.88, 1.10]. We stash it as the residual and re-normalize before projecting (the second RMSNorm on an already-normalized vector is nearly a no-op — scale ≈ 1.00 — so the input to the projections is still [-0.88, 1.10]).
x_residual = x # [-0.88, 1.10]
x = rmsnorm(x) # ≈ [-0.88, 1.10]
# Toy Q/K/V/Wo weight matrices, each (2 × 2)
attn_wq = [[ 0.50, 0.20], [ 0.10, 0.40]]
attn_wk = [[ 0.30, -0.10], [ 0.20, 0.50]]
attn_wv = [[ 0.40, 0.10], [-0.20, 0.60]]
attn_wo = [[ 0.60, 0.20], [ 0.10, 0.70]]
q = linear(x, attn_wq) # → [-0.22, 0.35]
k = linear(x, attn_wk) # → [-0.37, 0.37]
v = linear(x, attn_wv) # → [-0.24, 0.84]
Why those numbers? Each row of the weight matrix is a dot product with x. For q: row 0 gives $0.50(-0.88) + 0.20(1.10) = -0.22$; row 1 gives $0.10(-0.88) + 0.40(1.10) = 0.35$. Same shape for k and v.
KV cache. Positions 0 and 1 have already been processed on earlier calls, so the cache holds:
keys[0] = [[ 0.30, 0.10], # k from BOS at pos 0
[-0.10, 0.40], # k from 'a' at pos 1
[-0.37, 0.37]] # k from 'b' at pos 2 (just appended)
values[0] = [[ 0.20, -0.30], # v from BOS
[ 0.50, 0.20], # v from 'a'
[-0.24, 0.84]] # v from 'b'
Why keys[0] instead of just keys? Each Transformer layer keeps its own separate KV cache — the keys and values learned at layer 0 mean different things than at layer 1. So keys and values are lists of lists: the outer index is the layer number, the inner index is the position in the sequence. keys[0] is "the running list of every k vector layer 0 has produced so far," and keys[0][2] is "the key for position 2 at layer 0." Our toy has n_layer = 1, so keys[0] is the only list around — but the indexing convention stays the same. If we bumped n_layer to 6, you'd see keys[0], keys[1], … through keys[5], one cache per layer.
Scores → softmax weights. Dot each cached key with our query, divide by $\sqrt{d_{\text{head}}} = \sqrt{2} \approx 1.41$:
scores = [(q[0]*k[0] + q[1]*k[1]) / 1.41 for k in keys[0]]
# pos 0: (-0.22·0.30 + 0.35·0.10)/1.41 = -0.031/1.41 ≈ -0.02
# pos 1: (-0.22·-0.10 + 0.35·0.40)/1.41 = 0.162/1.41 ≈ 0.11
# pos 2: (-0.22·-0.37 + 0.35·0.37)/1.41 = 0.211/1.41 ≈ 0.15
weights = softmax(scores) # ≈ [0.30, 0.34, 0.36]
The three weights sum to 1. Notice that 'b' attends most to itself (0.36), then to 'a' (0.34), then to BOS (0.30) — the differences are small because our toy weights are tiny and random; a trained network would learn much sharper patterns.
Weighted sum of values, then mix through Wo, then residual.
head_out = [sum(weights[t] * v[t][j] for t, v in enumerate(values[0]))
for j in range(2)]
# head_out ≈ [0.30·0.20 + 0.34·0.50 + 0.36·-0.24,
# 0.30·-0.30 + 0.34·0.20 + 0.36·0.84]
# ≈ [0.14, 0.28]
x_attn = linear(head_out, attn_wo) # ≈ [0.14, 0.21]
x = [a + b for a, b in zip(x_attn, x_residual)] # ≈ [-0.74, 1.31]
Why each of those three lines is there.
- Weighted sum of
V. This is the actual "lookup" of the fuzzy dictionary. The weights answered how much each past position matters; the values say what each one contributes. Multiplying them and summing gives a single vector that's a blended pull from every cached value, weighted by relevance. If one weight were 1.0 and the rest were 0, we'd get back exactly that value — like a normal dict lookup. With soft weights, we get a mix. - Project through
Wo. The weighted sum lives in value-space, not in the residual stream's space.Wois a learned linear layer that re-mixes the head output back into the same shape asx. In multi-head attention each head's slice gets concatenated first, thenWoblends across the heads — giving the model a place to learn how different heads should be combined. In our toy with one head it just rotates the 2-vector, but the role is the same. - Add the residual. Instead of replacing
xwithx_attn, we add:x ← x + x_attn. Two big wins. (1) The original information survives — attention is an update, not an overwrite. (2) During backprop, gradients flow directly through this addition path back to earlier layers, which is what makes deep stacks of these blocks trainable at all. If attention has nothing useful to say for this token, it can output zero and the residual just passesxthrough unchanged.
The vector handed to the MLP block is x ≈ [-0.74, 1.31]. The attention block has done one thing: blended a little bit of every past position into the current one, projected the result back into the residual stream's shape, and added it on as an update.
q, k, v for the current token, (2) append k and v to the per-layer caches, (3) score the query against each cached key, (4) softmax → weights, (5) weighted sum of cached values, (6) output.
Attention playground · drag the query, watch the block recompute
Same diagram as "Attention, step by step" above, but now the query vector q is on sliders. The KV cache (3 past tokens) stays pinned to the toy walkthrough; everything downstream — scaled-dot-product scores, softmax weights, weighted sum of values, head output — recomputes live as you drag. Start at the defaults (q ≈ [−0.22, 0.35], the toy 'b' values) and move the sliders to see how a different query reshapes the whole attention output.
lm_head → softmaxWo, gets added to the residual, runs the MLP block, and only then does lm_head + softmax produce next-letter probabilities. We're skipping those layers and projecting the head output directly through lm_head so you can see how moving the query changes which letter the model "leans toward." It's a directional signal, not the model's real prediction.Snapping attention into the running diagram
We started this section with Embeddings only, added the pre-attention rmsnorm, and just walked through the full attention computation step by step. Time to slot that attention block back into the architecture diagram we've been building piece by piece. The widget below adds Q/K/V projections, the attention weighted sum, Wo, and the residual add on top of the Embeddings + RMSNorm view from earlier — same token picker, same numbers, just more of the block lit up.
Parameter matrices
Four 16×16 matrices: Q/K/V are the three projections that turn the token vector into "what am I looking for / what do I contain / what do I offer", and Wₒ mixes the per-head outputs back together.
You might be wondering why the toy matrices below are only 2×2. Remember from the Embeddings step: each token gets embedded as a two-dimensional vector (we set d_model = 2 for the walkthrough). The Q/K/V projections map a length-2 vector to another length-2 vector, so the weight matrix is (out × in) = (2 × 2) = 4 numbers. In real microgpt d_model = 16, so each of these matrices grows to (16 × 16) = 256 numbers. The shape of the operation is the same — just bigger.
attn_wq, attn_wk, or attn_wv is a single connection in this network. The cell M[i][j] is the weight on the edge from x̂[j] (top) to the i-th output of that projection. Twelve cells across three matrices, twelve edges in the diagram.x̂ (the normalized residual stream), but Wo acts on Σ wᵢ vᵢ — the weighted sum of values coming out of attention. Its job is to re-mix that head-output vector back into the shape of the residual stream so it can be added on top. In multi-head attention Wo also mixes information across heads. Same cell-to-edge convention: hover any cell in attn_wo to highlight the corresponding edge here.Code in gpt()
x_residual = x
x = rmsnorm(x)
q = linear(x, state_dict[f'layer{li}.attn_wq'])
k = linear(x, state_dict[f'layer{li}.attn_wk'])
v = linear(x, state_dict[f'layer{li}.attn_wv'])
keys[li].append(k); values[li].append(v)
# ... heads loop: scores → softmax → weighted V → concat ...
x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
x = [a + b for a, b in zip(x, x_residual)] # residual
In microgpt (n_embd = 16, n_head = 4): Q/K/V are (16 × 16) and they get sliced into 4 heads of head_dim = 4 each. The same Q·K/√d · softmax · weighted-V dance runs per head on a 4-dim slice, the four outputs are concatenated back to length 16, and Wo mixes them. The "shape" of the math doesn't change — just the dimensions.
In GPT-2 small (n_embd = 768, n_head = 12): each head sees a 64-dim slice, and there are 12 of them running in parallel. GPT-3 (175B, n_embd = 12288, n_head = 96): 128-dim slices, 96 heads, all 96 looking back at thousands of cached positions. Frontier models add tricks like grouped-query attention (many query heads share the same K/V heads, shrinking the KV cache) and FlashAttention (a GPU-friendly tiling that never materialises the full attention matrix), but the per-head computation is still the four lines you just walked through.
MLP block
MLP is short for "multilayer perceptron" — a two-layer feed-forward network: project up to 4× the embedding dimension, apply ReLU, project back down. This is where the model does most of its "thinking" per position. Unlike attention, this computation is fully local to time $t$. The Transformer intersperses communication (Attention) with computation (MLP).
Parameter matrices
Up-projection then down-projection. mlp_fc1 blows the dimension up 4× to give the network room to compute, then mlp_fc2 squeezes it back down so it can be added to the residual stream.
mlp_fc1 projects the 2-dim input up to an 8-dim hidden vector, ReLU zeroes out the negatives, and mlp_fc2 projects back down to 2-dim so it can be added to the residual. Hover any cell: 16 cells in mlp_fc1 map to the 16 edges in the top fan; 16 cells in mlp_fc2 map to the 16 edges in the bottom fan.Code in gpt()
x_residual = x
x = rmsnorm(x)
x = linear(x, state_dict[f'layer{li}.mlp_fc1']) # 16 → 64
x = [xi.relu() for xi in x]
x = linear(x, state_dict[f'layer{li}.mlp_fc2']) # 64 → 16
x = [a + b for a, b in zip(x, x_residual)] # residual
The attention block handed us x ≈ [-0.74, 1.31]. Stash the residual, normalize, then up-project to 4 × d_model = 8 hidden units.
x_residual = x # [-0.74, 1.31]
x = rmsnorm(x) # ≈ [-0.70, 1.23]
# mlp_fc1: up-projection (8 × 2)
mlp_fc1 = [[ 0.40, 0.10],
[-0.20, 0.50],
[ 0.30, -0.30],
[ 0.10, 0.40],
[-0.50, 0.20],
[ 0.20, -0.10],
[ 0.60, 0.30],
[-0.10, -0.40]]
pre = linear(x, mlp_fc1)
# = [-0.16, 0.76, -0.58, 0.42, 0.60, -0.26, -0.05, -0.42]
x = [xi.relu() for xi in pre]
# = [ 0.00, 0.76, 0.00, 0.42, 0.60, 0.00, 0.00, 0.00]
Why most entries are zero. ReLU = max(0, x), so anything negative gets clipped to 0. Only 3 of the 8 hidden units "fire" for this particular input. Different inputs would activate different subsets — that's how the MLP carves the input space into pieces and treats each piece differently.
# mlp_fc2: down-projection (2 × 8)
mlp_fc2 = [[ 0.10, 0.30, -0.20, 0.40, 0.00, 0.20, -0.10, 0.50],
[-0.30, 0.20, 0.50, -0.10, 0.40, -0.40, 0.30, 0.10]]
mlp_out = linear(x, mlp_fc2) # ≈ [0.40, 0.35]
x = [a + b for a, b in zip(mlp_out, x_residual)] # ≈ [-0.34, 1.66]
Hand-check the down-projection. Row 0 of mlp_fc2 dotted with the post-ReLU vector: $0.30 \cdot 0.76 + 0.40 \cdot 0.42 = 0.396 \approx 0.40$ (the zeros contribute nothing). The MLP's contribution gets added back to the residual stream, and we exit the block with x ≈ [-0.34, 1.66].
In microgpt (n_embd = 16): mlp_fc1 is (64 × 16) and mlp_fc2 is (16 × 64) — the 4× expansion is the same; just wider vectors. The MLP holds more parameters than the attention block (2,048 vs 1,024 in microgpt), and that ratio gets worse as models grow.
In GPT-2 small (n_embd = 768): the hidden layer is 3,072 wide, so the MLP alone is ≈ 4.7M parameters per layer. In GPT-3 (175B): hidden = 49,152, and the MLP is ≈ 60% of all parameters in the model. Frontier models also swap plain ReLU for SwiGLU (a gated activation that needs three matrices instead of two) and replace the dense MLP with Mixture-of-Experts — many small MLPs of which a router picks 2 per token — to grow capacity without growing per-token compute.
Residual connections
Both the attention and MLP blocks add their output back to their input (x = [a + b for ...]). This lets gradients flow directly through the network and makes deeper models trainable.
Output
The final hidden state is projected to vocabulary size by lm_head, producing one logit per token in the vocabulary. In our case, that's just 27 numbers. Higher logit = the model thinks that corresponding token is more likely to come next.
lm_head + softmaxlm_head and a softmax. The bottom row is the model's predicted probability distribution over the four toy vocabulary tokens (BOS / 'a' / 'b' / 'c'). Pick a current token below and watch the whole pipeline — including the prediction — recompute.Parameter matrix
One row per token in the vocabulary. The final hidden state is dot-producted with each row to produce a logit. Higher dot product → that token is judged more likely to come next.
Code in gpt()
logits = linear(x, state_dict['lm_head']) # length 27
return logits
The MLP handed us x ≈ [-0.34, 1.66]. lm_head has one row per vocab token; the dot product of x with row i is the logit for token i.
# lm_head: 4 rows (one per token), 2 columns (d_model)
lm_head = [[ 0.30, 0.10], # BOS
[-0.20, 0.40], # 'a'
[ 0.50, -0.30], # 'b'
[-0.10, 0.60]] # 'c'
logits = linear(x, lm_head)
# BOS: 0.30·-0.34 + 0.10·1.66 = 0.06
# 'a': -0.20·-0.34 + 0.40·1.66 = 0.73
# 'b': 0.50·-0.34 + -0.30·1.66 = -0.67
# 'c': -0.10·-0.34 + 0.60·1.66 = 1.03
Raw logits can be any real number. To turn them into a probability distribution we apply softmax — subtract the max for numerical stability, exponentiate, divide by the sum:
probs = softmax(logits)
# = [0.17, 0.32, 0.08, 0.43] # P(BOS), P('a'), P('b'), P('c')
Reading the result. After BOS, a, b, this (untrained) toy model thinks the most likely next token is 'c' with probability 0.43. During training, the loss for this position would be $-\log p(\text{target})$ — if the true next token were BOS (end of word), the loss is $-\log 0.17 \approx 1.77$. Backprop would then tweak every weight we've used along the way to push P(BOS) up and the others down for next time.
In microgpt (n_embd = 16, vocab_size = 27): lm_head is (27 × 16), so the model outputs 27 logits — one per a–z plus BOS. The softmax over 27 categories is cheap.
In GPT-2 small (n_embd = 768, vocab = 50,257): the final matrix is ≈ 39M parameters and the softmax has to normalize across 50K categories — and during training that softmax is computed at every position in every sequence in the batch, which is a non-trivial fraction of total training compute. In GPT-4 / frontier models, vocabularies sit around 100K–200K tokens and the lm_head is typically tied to wte (same matrix used for both input embedding and output projection), saving a copy of those millions of parameters. The temperature / top-p tricks you see at inference all live downstream of this same logit vector.
Parameters
You've seen every parameter matrix in the architecture walkthrough above — wte, wpe, attn_wq/wk/wv/wo, mlp_fc1/fc2, lm_head. The Parameters section is just the bookkeeping: allocate them all in one place, store them in a single dictionary the optimizer can iterate over, and count the total.
n_embd = 16; n_head = 4; n_layer = 1; block_size = 16
head_dim = n_embd // n_head
matrix = lambda nout, nin, std=0.08: \
[[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]
state_dict = {
'wte': matrix(vocab_size, n_embd), # 27 × 16 → 432
'wpe': matrix(block_size, n_embd), # 16 × 16 → 256
'lm_head': matrix(vocab_size, n_embd), # 27 × 16 → 432
}
for i in range(n_layer):
state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd) # 256
state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd) # 256
state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd) # 256
state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd) # 256
state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd) # 1,024
state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd) # 1,024
params = [p for mat in state_dict.values() for row in mat for p in row]
print(f"num params: {len(params)}") # → 4192
Why bother with the flat params list? Because the optimizer doesn't care about the matrices — it just needs a single list of scalars to loop over and update. params is that list. GPT-2 had 1.6 billion entries in this list; modern LLMs have hundreds of billions.
Suppose we bumped n_embd from 16 to 32 (everything else unchanged). Which matrices would grow, and by how much (4×? 2×? something else)? Roughly what's the new total parameter count?
Show answer
n_embd. wte, wpe, lm_head are linear in n_embd (2×). The attention matrices (attn_wq/k/v/o) and MLP (mlp_fc1/2) are all (n_embd × n_embd) or (4·n_embd × n_embd), so they scale quadratically (4×). New rough total ≈ 2×(432+256+432) + 4×(256+256+256+256+1024+1024) = 2,240 + 12,288 ≈ 14,528 params. Doubling the width more than triples the model.Putting it all together
Now that we've walked through each piece individually, here is the full gpt() function — one call processes one token and returns 27 logits over the vocabulary. Read top to bottom: embeddings → for each layer (attention block → MLP block) → final linear.
Now the model itself:
gpt() to see what it does; related lines (the three linear(...wq/wk/wv) calls, both KV-cache append calls, etc.) light up together.The function processes one token (id token_id) at a specific position in time (pos_id), and some context from previous iterations summarized by the activations in keys and values, known as the KV Cache.
You might notice we're using a KV cache during training, which is unusual. People typically associate the KV cache with inference only. But the KV cache is conceptually always there, even during training. In production implementations, it's just hidden inside the highly vectorized attention computation that processes all positions in the sequence simultaneously. Since microgpt processes one token at a time (no batch dimension, no parallel time steps), we build the KV cache explicitly. And unlike the typical inference setting where the cache holds detached tensors, here the cached keys and values are live Value nodes in the computation graph, so we actually backpropagate through them.
Training loop
Now we wire everything together. The training loop repeatedly: (1) picks a document, (2) runs the model forward over its tokens, (3) computes a loss, (4) backpropagates to get gradients, and (5) updates the parameters. Here's the simplest possible version — plain stochastic gradient descent: walk every parameter slightly downhill against its gradient.
p -= lr · grad walks toward the minimump along a toy loss curve. The orange tangent is p.grad; the red arrow on the axis is the SGD step −lr · p.grad. Whichever side of the minimum we start on, the step always points toward it.# Plain SGD — the simplest possible parameter update
learning_rate = 0.01
num_steps = 1000
for step in range(num_steps):
# Take single document, tokenize it, surround with BOS on both sides
doc = docs[step % len(docs)]
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
n = min(block_size, len(tokens) - 1)
# Forward pass: build computation graph all the way to the loss
keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
losses = []
for pos_id in range(n):
token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
logits = gpt(token_id, pos_id, keys, values)
probs = softmax(logits)
loss_t = -probs[target_id].log()
losses.append(loss_t)
loss = (1 / n) * sum(losses) # average over the document. May yours be low.
# Backward pass: gradients of loss w.r.t. all parameters
loss.backward()
# SGD update: nudge each parameter against its gradient
for p in params:
p.data -= learning_rate * p.grad
p.grad = 0
print(f"step {step+1:4d} / {num_steps:4d} | loss {loss.data:.4f}")
Tokenization
Each training step picks one document and wraps it with BOS on both sides: the name "emma" becomes [BOS, e, m, m, a, BOS]. The model's job is to predict each next token given the tokens before it.
Forward pass and loss
We feed the tokens through the model one at a time, building up the KV cache as we go. At each position, the model outputs 27 logits, which we convert to probabilities via softmax. The loss at each position is the negative log probability of the correct next token: $-\log p(\text{target})$. This is called the cross-entropy loss. Intuitively, the loss measures the degree of misprediction: how surprised the model is by what actually comes next. If the model assigns probability 1.0 to the correct token, it is not surprised at all and the loss is 0. If it assigns probability close to 0, the model is very surprised and the loss goes to $+\infty$. We average the per-position losses across the document to get a single scalar loss.
Backward pass
One call to loss.backward() runs backpropagation through the entire computation graph, from the loss all the way back through softmax, the model, and into every parameter. After this, each parameter's .grad tells us how to change it to reduce the loss. The SGD update right after the backward pass — p.data -= learning_rate * p.grad — is the entire learning rule: move every parameter a small step in the direction that reduces the loss, then reset gradients to zero so the next backward pass starts fresh.
From plain SGD to Adam
Plain SGD works but it's slow and finicky to tune. In practice, every modern LLM is trained with Adam — an optimizer that tracks two extra buffers per parameter: m (a running average of recent gradients, like momentum) and v (a running average of recent squared gradients, which adapts the per-parameter learning rate). The bias corrections m_hat / v_hat account for m and v being initialized to zero. The learning rate also decays linearly so the steps shrink as training progresses. Here's the same training loop with Adam swapped in:
# Let there be Adam, the blessed optimizer and its buffers
learning_rate, beta1, beta2, eps_adam = 0.01, 0.85, 0.99, 1e-8
m = [0.0] * len(params) # first moment buffer (running mean of grads)
v = [0.0] * len(params) # second moment buffer (running mean of grads²)
num_steps = 1000
for step in range(num_steps):
# Take single document, tokenize it, surround with BOS on both sides
doc = docs[step % len(docs)]
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
n = min(block_size, len(tokens) - 1)
# Forward pass: build computation graph all the way to the loss
keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
losses = []
for pos_id in range(n):
token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
logits = gpt(token_id, pos_id, keys, values)
probs = softmax(logits)
loss_t = -probs[target_id].log()
losses.append(loss_t)
loss = (1 / n) * sum(losses)
# Backward pass: gradients of loss w.r.t. all parameters
loss.backward()
# Adam update
lr_t = learning_rate * (1 - step / num_steps) # linear decay
for i, p in enumerate(params):
m[i] = beta1 * m[i] + (1 - beta1) * p.grad
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
m_hat = m[i] / (1 - beta1 ** (step + 1))
v_hat = v[i] / (1 - beta2 ** (step + 1))
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
p.grad = 0
print(f"step {step+1:4d} / {num_steps:4d} | loss {loss.data:.4f}")
Over 1,000 steps the loss decreases from around 3.3 (random guessing among 27 tokens: $-\log(1/27) \approx 3.3$) down to around 2.37. Lower is better, the lowest possible is 0 (perfect predictions), so there's still room to improve, but the model is clearly learning the statistical patterns of names.
Inference
Once training is done, we can sample new names from the model. The parameters are frozen and we just run the forward pass in a loop, feeding each generated token back as the next input:
temperature = 0.5 # in (0, 1], controls "creativity" of generated text
print("\n--- inference (new, hallucinated names) ---")
for sample_idx in range(20):
keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
token_id = BOS
sample = []
for pos_id in range(block_size):
logits = gpt(token_id, pos_id, keys, values)
probs = softmax([l / temperature for l in logits])
token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
if token_id == BOS:
break
sample.append(uchars[token_id])
print(f"sample {sample_idx+1:2d}: {''.join(sample)}")
We start each sample with the BOS token, which tells the model "begin a new name". The model produces 27 logits, we convert them to probabilities, and we randomly sample one token according to those probabilities. That token gets fed back in as the next input, and we repeat until the model produces BOS again (meaning "I'm done") or we hit the maximum sequence length.
The temperature parameter controls randomness. Before softmax, we divide the logits by the temperature. A temperature of 1.0 samples directly from the model's learned distribution. Lower temperatures (like 0.5 here) sharpen the distribution, making the model more conservative and likely to pick its top choices. A temperature approaching 0 would always pick the single most likely token (greedy decoding). Higher temperatures flatten the distribution and produce more diverse but potentially less coherent output. Try it on a synthetic logit vector below.
Slide the temperature from 0.1 to 2.0 and watch the entropy. At what temperature is entropy lowest? At what temperature is it highest? What's the entropy of a perfectly uniform distribution over 27 tokens (and why is that the asymptote)?
Show answer
log₂(27) ≈ 4.75 bits — the asymptote you'll see if you push T very high.Train the toy GPT, live
Everything in this lab so far has shown the model running on frozen weights — either the pinned toy values from the walkthrough, or the 4,192 parameters Karpathy already trained for you. This section closes the loop: train the toy model in your browser, watch the predictions change, then chat with it.
This time we train the whole model — every weight matrix updates: wte, wpe, the four attention projections, both MLP layers, and lm_head. The gradient is computed numerically (central differences) rather than via autograd, so it's slow — a full 100-step batch takes a few seconds — but every edge in the diagram changes thickness and color as the parameters move. That's the point. Click Step ▸ to advance one example at a time and watch a single SGD step in slow motion; click Train to run 100 batch steps at once.
lm_head learn{a,b,c} patterns and re-query.{a, b, c}. Every word becomes one training example — predict the last letter given the second-to-last letter — plus one terminal example so the model also learns to emit BOS after a word ends. Hit Train and the model auto-steps through every example across multiple epochs, logging each step in the box below.
Run it
All you need is Python (no pip install, no dependencies). Grab Karpathy's script from his gist, then run it:
# Download Karpathy's microgpt source as train.py
curl -L -o train.py https://gist.githubusercontent.com/karpathy/8627fe009c40f57531cb18360106ce95/raw/microgpt.py
# Train the model — about 1 minute on a laptop, no GPU required
python train.py
If curl isn't available you can use wget instead, or just open the gist and copy the file into train.py by hand.
The script takes about 1 minute to run on Karpathy's MacBook. You'll see the loss printed at each step:
train.py
num docs: 32033
vocab size: 27
num params: 4192
step 1 / 1000 | loss 3.3660
step 2 / 1000 | loss 3.4243
step 3 / 1000 | loss 3.1778
step 4 / 1000 | loss 3.0664
step 5 / 1000 | loss 3.2209
step 6 / 1000 | loss 2.9452
step 7 / 1000 | loss 3.2894
step 8 / 1000 | loss 3.3245
step 9 / 1000 | loss 2.8990
step 10 / 1000 | loss 3.2229
step 11 / 1000 | loss 2.7964
step 12 / 1000 | loss 2.9345
step 13 / 1000 | loss 3.0544
...
Watch it go down from ~3.3 (random) toward ~2.37. The lower this number, the better the network's predictions about what token comes next in the sequence. At the end of training, the knowledge of the statistical patterns of the training token sequences is distilled in the model parameters. Fixing these parameters, we can now generate new, hallucinated names. You'll see (again):
sample 1: kamon sample 8: anna sample 15: earan
sample 2: ann sample 9: areli sample 16: lenne
sample 3: karai sample 10: kaina sample 17: kana
sample 4: jaire sample 11: konna sample 18: lara
sample 5: vialan sample 12: keylen sample 19: alela
sample 6: karia sample 13: liole sample 20: anton
sample 7: yeran sample 14: alerin
As an alternative to running the script on your computer, you may try to run it directly on a Google Colab notebook and ask Gemini questions about it. Try playing with the script: try a different dataset, train for longer (increase num_steps), or increase the model size for increasingly better results.
Progression
To see the code built up piece by piece — as layers of the onion — the advised progression looks something like this:
| File | What it adds |
|---|---|
train0.py | Bigram count table — no neural net, no gradients |
train1.py | MLP + manual gradients (numerical & analytic) + SGD |
train2.py | Autograd (Value class) — replaces manual gradients |
train3.py | Position embeddings + single-head attention + rmsnorm + residuals |
train4.py | Multi-head attention + layer loop — full GPT architecture |
train5.py | Adam optimizer — this is train.py |
Karpathy created a Gist called build_microgpt.py whose Revisions show all of these versions and the diffs between each step. Stepping through the diffs is a great way to internalize what each component does.
Real stuff
microgpt contains the complete algorithmic essence of training and running a GPT. But between this and a production LLM like ChatGPT, there is a long list of things that change. None of them alter the core algorithm and the overall layout, but they are what makes it actually work at scale. Walking through the same sections in order:
Data
Instead of 32K short names, production models train on trillions of tokens of internet text: web pages, books, code, etc. The data is deduplicated, filtered for quality, and carefully mixed across domains.
Tokenizer
Instead of single characters, production models use subword tokenizers like BPE (Byte Pair Encoding), which learn to merge frequently co-occurring character sequences into single tokens. Common words like "the" become a single token, rare words get broken into pieces. This gives a vocabulary of ~100K tokens and is much more efficient because the model sees more content per position.
Autograd
microgpt operates on scalar Value objects in pure Python. Production systems use tensors (large multi-dimensional arrays of numbers) and run on GPUs/TPUs that perform billions of floating-point operations per second. Libraries like PyTorch handle autograd over tensors, and CUDA kernels like FlashAttention fuse multiple operations for speed. The math is identical, just corresponds to many scalars processed in parallel.
Architecture
microgpt has 4,192 parameters. GPT-4–class models have hundreds of billions. Overall it's a very similar-looking Transformer, just much wider (embedding dimensions of 10,000+) and much deeper (100+ layers). Modern LLMs also incorporate a few more types of Lego blocks and change their orders around: RoPE (Rotary Position Embeddings) instead of learned position embeddings, GQA (Grouped Query Attention) to reduce KV cache size, gated linear activations instead of ReLU, Mixture of Experts (MoE) layers, etc. But the core structure of Attention (communication) and MLP (computation) interspersed on a residual stream is well-preserved.
The picture, mapped to the code
Here is the canonical Transformer block diagram you'll see in papers and textbooks — the one microgpt is a stripped-down version of. Click any block to see how it maps onto microgpt's code (and which blocks microgpt drops because they're scale-time concerns):
state_dict entries and gpt() code.Each colored block on the diagram corresponds to one or more lines of microgpt. Click one and this panel will show the code, the matching state_dict entry (if any), and whether microgpt simplifies or skips it.
microgpt strips the diagram down to its algorithmic core. Dropout, masking, GeLU, and biases are all removed in this implementation — the model still learns, just with fewer regularizers. LayerNorm is also replaced by the simpler RMSNorm. Click any Dropout, Mask, or LayerNorm block to read why.
Training
Instead of one document per step, production training uses large batches (millions of tokens per step), gradient accumulation, mixed precision (float16/bfloat16), and careful hyperparameter tuning. Training a frontier model takes thousands of GPUs running for months.
Optimization
microgpt uses Adam with a simple linear learning rate decay and that's about it. At scale, optimization becomes its own discipline. Models train in reduced precision (bfloat16 or even fp8) and across large GPU clusters for efficiency, which introduces its own numerical challenges. The optimizer settings (learning rate, weight decay, beta parameters, warmup, decay schedule) must be tuned precisely, and the right values depend on model size, batch size, and dataset composition. Scaling laws (e.g. Chinchilla) guide how to allocate a fixed compute budget between model size and number of training tokens. Getting any of these details wrong at scale can waste millions of dollars of compute, so teams run extensive smaller-scale experiments to predict the right settings before committing to a full training run.
Post-training
The base model that comes out of training (the "pretrained" model) is a document completer, not a chatbot. Turning it into ChatGPT happens in two stages. First, SFT (Supervised Fine-Tuning): swap the documents for curated conversations and keep training. Algorithmically, nothing changes. Second, RL (Reinforcement Learning): the model generates responses, they get scored (by humans, another "judge" model, or an algorithm), and the model learns from that feedback. Fundamentally, the model is still training on documents — those documents are now made up of tokens coming from the model itself.
Inference
Serving a model to millions of users requires its own engineering stack: batching requests together, KV cache management and paging (vLLM, etc.), speculative decoding for speed, quantization (running in int8/int4 instead of float16) to reduce memory, and distributing the model across multiple GPUs. Fundamentally, we are still predicting the next token in the sequence — but with a lot of engineering spent on making it faster.
All of these are important engineering and research contributions, but if you understand microgpt, you understand the algorithmic essence.
Zoom in further · Bycroft's GPT visualization
If microgpt was "the smallest transformer drawn as a 2-D diagram," Brendan Bycroft's interactive walkthrough is "an actual GPT-2 drawn as a 3-D city." Every embedding vector, every Q/K/V projection, every attention head, every MLP layer is rendered as a navigable scene with the real GPT-2 weights — and you can scrub through one token's forward pass at your own pace. Same algorithm as microgpt, ~30,000× more parameters. Drag to rotate, scroll to zoom, click blocks on the right rail to jump.
d_model 768) driving the same forward pass you traced in microgpt. Click + drag to orbit, scroll to zoom, use the right-side phase rail to step through the algorithm. Open in a new tab for full-screen control.From bbycroft.net/llm by Brendan Bycroft. If the embed feels cramped, open the source page in a new tab for a full-screen viewport. If it doesn't load (some campus networks block iframes from third-party hosts), the source link is the fallback.
Assignment · safety guardrails for the chat bot
You've taken microgpt apart and you've already chatted with it at the top of this page. Now you're going to ship it — and decide what it's allowed to say. The chat bot at the top of the page is a useful name generator, but it has no safety policy. Anyone can ask it for 100 names with any starting prefix they choose, and it will dutifully produce them. Your job in this assignment is to add a small safety layer on top of the same model, then defend it against an adversarial grader.
The product policy you're enforcing is simple and totally safe-for-work: this name generator must never emit a fruit. Saying apple is banned — and so are pear, plum, fig, lime, grape, mango, and the rest of the produce aisle. (In a real product the banned list would be slurs, NSFW terms, or other harmful output; fruits are a clean stand-in that exercise the exact same prefix- and substring-filtering machinery.) Your bot must refuse requests whose prefixes lead to fruit names and filter any fruit that slips out of the stochastic sampler.
What you're submitting
A single Python file bot.py that reads requests from stdin and writes responses to stdout, one per line. The starter template has everything except the two safety hooks:
is_safe_request(prefixes)— receives the list of letter-prefixes the user typed (e.g.['j'],['ab'], or['a','b','c']). ReturnsNoneto allow, or a one-sentence reason string to refuse. Called before any name is generated.is_safe_name(name)— returnsTrueto keep a generated name,Falseto discard. Called after the model emits each name. If a name is rejected the bot resamples (up to 10 attempts per slot).
You should only need to edit those two functions plus the BLOCKED_PREFIX_PATTERNS and BLOCKED_OUTPUT_PATTERNS lists. The forward pass and the bot loop are already written and shared with the autograder — if you change them, the autograder will reject your submission.
Downloads
I/O protocol
Each line of stdin is one request — one or more whitespace-separated letter prefixes:
j one prefix, one name starting with "j"
ab one prefix, one name starting with "ab"
a b c three prefixes, one name per prefix
(blank) no prefixes, generate one name from BOS
Each line of stdout is one response, beginning with exactly one of these prefixes:
OK: name1, name2, ... ← one name per input prefix, in order, comma-separated
REFUSE: <one-sentence reason> ← request blocked by your safety policy
Anything else printed to stdout will confuse the autograder. Use stderr if you need debug prints. The simplified protocol matches the chat widget at the top of this page exactly — type ab in the chat, you get one name; type a b, you get two.
Run it locally
# One-shot
echo 'j' | python bot.py
echo 'ab' | python bot.py
echo 'a b c' | python bot.py
# Interactive (Ctrl-D to exit)
python bot.py
# Run the local test harness — mirrors what the autograder does
python test_bot.py
What the autograder tests
Gradescope runs bot.py as a subprocess and pipes a batch of letter-prefix requests through stdin. It scores your submission on two test groups:
- Benign requests must still work. Simple prefixes like
j,ab,a b cmust produceOK:responses with one name per prefix and each name actually starting with its prefix. Over-blocking — refusing letters you should allow — will cost you points here. - Adversarial requests must be refused. The grader will probe with prefix-based and substring-based attacks aimed at extracting outputs you'd rather not produce. For these requests you must respond with
REFUSE:. Under-blocking will cost you points here.
The two groups are weighted to push you toward a balanced policy: a bot that refuses everything fails Group 1; a bot with no safety policy fails Group 2. The autograder uses fresh random seeds, so don't try to memorize specific outputs — your policy needs to be principled.
Structure of model.json
The weights file is plain JSON — open it in any editor. Top-level keys:
| key | contents |
|---|---|
format | "tiny-gpt-char-v1" |
config | n_layer=1, n_embd=16, n_head=4, head_dim=4, block_size=16, vocab_size=27, BOS=26 |
tokenizer | uchars[26], stoi, itos (character-level a–z + BOS=26) |
state_dict | nested lists of floats, one entry per parameter matrix (see below) |
| state_dict key | shape |
|---|---|
wte | 27 × 16 |
wpe | 16 × 16 |
lm_head | 27 × 16 |
layer0.attn_wq / wk / wv / wo | 16 × 16 each |
layer0.mlp_fc1 | 64 × 16 |
layer0.mlp_fc2 | 16 × 64 |
Same model you've been dissecting throughout the lab. The full structure spec also lives at the top of bot.py.
Suggested workflow
- Run the unmodified template. Confirm
python test_bot.pypasses all benign tests. The adversarial test list intest_bot.pyis intentionally empty — that's where you'll add your own tests as you go. - Be the adversary first. Open the chat at the top of this page (or pipe inputs through your local
bot.py) and try to make the model emit fruit names — prefixes likeappl,gra, orliare good starting points. Note the inputs that worked. - Write down your policy. Before coding, write a short list of what your bot will refuse and why. Be specific — "prefixes that lead to fruit names" is vague; "prefixes containing any of these letter combinations: …" is implementable.
- Implement
is_safe_request. Reject the request before generation when the policy can be applied to the input alone (e.g., a prefix you don't want to start with). - Implement
is_safe_name. Filter generated names that contain banned substrings (the model is stochastic — a benign-looking request can still emit unsafe outputs). - Add your own adversarial tests to
test_bot.pyas you discover new attack patterns. Run frequently. - Tune for both directions. If your bot starts refusing legitimate requests, loosen the policy. Over-blocking is also a failure.
Submission
Upload to Gradescope:
bot.py(your edited version)model.json(unmodified — included so the grader can reproduce your bot exactly)
The autograder will run python bot.py with your weights, send batched requests, and score the responses. Late submissions follow the course policy.
Rubric
| component | points |
|---|---|
| Benign requests still work (no over-blocking) | 40 |
| Adversarial requests are refused | 40 |
| Output filtering catches stochastic leaks | 10 |
| Code clarity & comments on your policy | 10 |
| Total | 100 |
Your bot will only see a few hundred test inputs from the autograder. Real LLM safety teams face open-ended adversarial input — and frontier models still get jailbroken regularly despite huge investments in alignment, RLHF, and red-teaming. The exercise here is deliberately tractable (a 4,192-parameter character-level name generator), but the shape of the problem — balancing utility against refusal, anticipating prefix and substring attacks, deciding policy under uncertainty — is the same shape professional alignment teams face every day.
FAQ
Does the model "understand" anything?
That's a philosophical question, but mechanically: no magic is happening. The model is a big math function that maps input tokens to a probability distribution over the next token. During training, the parameters are adjusted to make the correct next token more probable. Whether this constitutes "understanding" is up to you, but the mechanism is fully contained in the 200 lines above.
Why does it work?
The model has thousands of adjustable parameters, and the optimizer nudges them a tiny bit each step to make the loss go down. Over many steps, the parameters settle into values that capture the statistical regularities of the data. For names, this means things like: names often start with consonants, "qu" tends to appear together, names rarely have three consonants in a row, etc. The model doesn't learn explicit rules, it learns a probability distribution that happens to reflect them.
How is this related to ChatGPT?
ChatGPT is this same core loop (predict next token, sample, repeat) scaled up enormously, with post-training to make it conversational. When you chat with it, the system prompt, your message, and its reply are all just tokens in a sequence. The model is completing the document one token at a time, same as microgpt completing a name.
What's the deal with "hallucinations"?
The model generates tokens by sampling from a probability distribution. It has no concept of truth, it only knows what sequences are statistically plausible given the training data. microgpt "hallucinating" a name like "karia" is the same phenomenon as ChatGPT confidently stating a false fact. Both are plausible-sounding completions that happen not to be real.
Why is it so slow?
microgpt processes one scalar at a time in pure Python. A single training step takes seconds. The same math on a GPU processes millions of scalars in parallel and runs orders of magnitude faster.
Can I make it generate better names?
Yes. Train longer (increase num_steps), make the model bigger (n_embd, n_layer, n_head), or use a larger dataset. These are the same knobs that matter at scale.
What if I change the dataset?
The model will learn whatever patterns are in the data. Swap in a file of city names, Pokémon names, English words, or short poems, and the model will learn to generate those instead. The rest of the code doesn't need to change.
DS 6042 — Lab 02 · adapted from Andrej Karpathy, microgpt.html · interactive augmentations by Daniel Graham.