DS 6042 · Lab 02 · Feb 12, 2026

microgpt — guided walkthrough

A 200-line GPT, taken apart and rebuilt in front of you.

Andrej Karpathy's post, augmented into a lab by Daniel Graham. Interactive visualizations for the parts that are easy to skim but worth understanding.

Before we can begin evaluating and auditing AI systems, we have to understand them from first principles. On Feb 12, 2026, Andrej Karpathy (co-founder at OpenAI; helped build Tesla Autopilot) released a 200-line pure-Python program implementing the fundamental ideas behind GPT. I've taken his post and turned it into a lab with exercises and visuals to help us understand the concepts deeply rather than skim them. Karpathy's post is already well written — the goal is to augment it. The Python here is also rewritten in a slightly less compressed style: ~2XX lines instead of 200, but a bit easier to read. As always, feel free to work with the people at your table. You've got this.

Original post: karpathy.ai/microgpt.html · companion video on autograd: The spelled-out intro to neural networks and backpropagation (2.5 hr)

Try it · generate names from the trained microgpt

microgpt is tiny (just 4,192 parameters) but it's still a real neural language model. Karpathy trained one on 32,033 first names (the makemore dataset). The weights are loaded right here in your browser, and the same forward pass you'll dissect later in the lab runs every time you press Send.

Type a single letter and press Enter.

Temperature 0.70 low = conservative · high = wild

Take it with you: ↓ model.json (weights)

Where to find it

GitHub gist with the full source code: microgpt.py
Also available on this web page: karpathy.ai/microgpt.html
Also available as a Google Colab notebook — you can run it without installing anything

The following is a guide that steps an interested reader through the code.

Dataset

The fuel of large language models is a stream of text data, optionally separated into a set of documents. In production-grade applications, each document would be an internet web page — but for microgpt, we use a simpler example of 32,000 names, one per line:

# Let there be an input dataset `docs`: list[str] of documents (e.g. a dataset of names)
if not os.path.exists('input.txt'):
    import urllib.request
    names_url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt'
    urllib.request.urlretrieve(names_url, 'input.txt')
docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)
print(f"num docs: {len(docs)}")

The dataset looks like this. Each name is a document:

emma
olivia
ava
isabella
sophia
charlotte
mia
amelia
harper
... (~32,000 names follow)

The goal of the model is to learn the patterns in the data and then generate similar new documents that share the statistical patterns within. As a preview, by the end of the script our model will generate ("hallucinate"!) new, plausible-sounding names. Skipping ahead, we'll get:

sample  1: kamon         sample  8: anna          sample 15: earan
sample  2: ann           sample  9: areli         sample 16: lenne
sample  3: karai         sample 10: kaina         sample 17: kana
sample  4: jaire         sample 11: konna         sample 18: lara
sample  5: vialan        sample 12: keylen        sample 19: alela
sample  6: karia         sample 13: liole         sample 20: anton
sample  7: yeran         sample 14: alerin

It doesn't look like much, but from the perspective of a model like ChatGPT, your conversation with it is just a funny-looking "document". When you initialize the document with your prompt, the model's response from its perspective is just a statistical document completion.

Tokenizer

Under the hood, neural networks work with numbers, not characters, so we need a way to convert text into a sequence of integer token ids and back. Production tokenizers like tiktoken (used by GPT-4) operate on chunks of characters for efficiency, but the simplest possible tokenizer just assigns one integer to each unique character in the dataset:

# Let there be a Tokenizer to translate strings to discrete symbols and back
uchars = sorted(set(''.join(docs)))   # unique characters become token ids 0..n-1
BOS = len(uchars)                     # token id for Beginning of Sequence
vocab_size = len(uchars) + 1          # total tokens, +1 for BOS
print(f"vocab size: {vocab_size}")

We collect all unique characters across the dataset (which are just the lowercase letters a–z), sort them, and each letter gets an id by its index. The integer values themselves carry no meaning — each token is just a discrete symbol. Instead of 0, 1, 2 they could be different emoji. We also create one special token, BOS (Beginning of Sequence), which acts as a delimiter: it tells the model "a new document starts/ends here". Later during training, each document gets wrapped with BOS on both sides: [BOS, e, m, m, a, BOS]. The model learns that BOS initiates a new name, and that another BOS ends it. So we have a vocabulary of 27 (26 lowercase letters + BOS).

Tokenizer playground

Type a name. It gets wrapped in BOS and converted to integer ids. This is exactly what the model sees as input.

Name:

Try it

The character "a" is the first alphabet letter, so it has id 0. What's the id of "z"? Of "BOS"? If your full name has 9 letters, how many tokens does the model see when you train on it?

Show answer

"z" is id 25 (last of a–z, indices 0..25). BOS is 26 (length of uchars = 26 alphabet letters). A 9-letter name produces 9 + 2 = 11 tokens: BOS, the 9 letters, then BOS again.

From a neuron to a network

Before we open up gpt() and stare at multi-head attention, let's build up the underlying object — the neuron — and stack neurons into a network. The end goal of this section: by the time we hit the architecture diagram, every box in it will feel like an obvious composition of things we already understand.

Here's roughly where we're going. Don't worry about the details — file the picture mentally, then we'll build to it. (You can already play with this — drag the input sliders and watch the activations propagate.)

Step 1

The simplest "neuron"

One input x, one bias b, and an output a = x + b. That's it — just an adder. No learning yet, no bend in the output. It's a useful starting object because every more complex neuron is just this one with more parts bolted on.

def neuron(x, b):
    return x + b

Try it

If x = 3 and b = -1, what does the neuron output? What if I want this neuron to always output 0 no matter the input? What b would I need (and would it work for every x)?

Show answer

Output is 3 + (−1) = 2. To force the output to 0 we'd need b = −x, which depends on x — a single bias can't do it. That's why we'll add a weight next: it lets the neuron scale its input before the bias.

Step 2

Add a weight

Multiply the input by a learned weight w before adding the bias: a = x*w + b. Now the neuron has two knobs. With both w and b the neuron can shift and scale — it can learn any affine [affine = scale the input, then shift it] response. This is the canonical "linear neuron".

def neuron(x, w, b):
    return x * w + b

Step 3

Add a nonlinearity (ReLU)

Stacking linear neurons on top of linear neurons just gives you another linear function. To learn interesting things, we need a nonlinearity. ReLU is the simplest: $f(z) = \max(0, z)$. It passes positive values through and zeros out negative ones.

def relu(z):
    return max(0, z)

def neuron(x, w, b):
    z = x * w + b
    a = relu(z)
    return a

Try it

With w = 2 and b = -3, plug in x = 1 and x = 4. What does the neuron output in each case? At what value of x does the ReLU "turn on" — i.e., where does the output stop being zero?

Show answer

x = 1 → z = 1·2 − 3 = −1 → a = max(0, −1) = 0. x = 4 → z = 5 → a = 5. The ReLU turns on at z = 0, i.e. when x = 3/2 = 1.5. The neuron has learned a soft threshold.

Step 4

Many inputs in, one output out

Real neurons take a vector of inputs. Each input x_i has its own weight w_i; the neuron sums them up, adds bias, and applies ReLU:

$$ a = \mathrm{ReLU}\!\left(\sum_{i=1}^{n} x_i w_i + b\right) $$

def neuron(x, w, b):           # x and w are lists of length n
    z = sum(xi * wi for xi, wi in zip(x, w)) + b
    return max(0, z)

The inner sum is a dot product — the fundamental operation of neural networks. In microgpt, linear(x, w) does this dot product once per row of w. (Karpathy's version drops the bias b — modern Transformers often do.)

Python aside · what does zip() do?

Python's built-in zip() walks through two (or more) lists in lockstep and hands back tuples of matching elements — one tuple per "column" — stopping when the shortest list runs out. So for xi, wi in zip(x, w) gives us the i-th input and the i-th weight together on each loop iteration, ready to multiply.

x 0.5 −0.3 1.2

w 0.4 0.7 −0.1

↓ zip(x, w) ↓

→ (0.5, 0.4) (−0.3, 0.7) (1.2, −0.1)

The dot product is then just "sum the products of each pair": $0.5{\cdot}0.4 + (-0.3){\cdot}0.7 + 1.2{\cdot}(-0.1) = 0.20 - 0.21 - 0.12 = -0.13$.

The same pattern shows up everywhere in microgpt — adding token + position embeddings (zip(tok_emb, pos_emb)), residual sums (zip(x, x_residual)), every matrix-vector multiply inside linear(). Anywhere you see two same-length lists walked together, zip is the glue.

Forward pass

In a neural network, the forward pass is the trip from inputs to a prediction. You hand the network some numbers, they flow through every layer — getting multiplied by weights, summed with biases, occasionally bent by a nonlinearity — and out the other end falls a single answer. The forward pass doesn't change the network at all; it just runs it. Every weight stays exactly where it was; only the activations move.

It's worth pausing on this before we get to backprop, because backprop is just the forward pass run in reverse. If we can't picture the forward pass clearly, the backwards version will feel like magic.

Below is a deliberately tiny network so you can wiggle every knob and watch the output respond. Three inputs x₁, x₂, x₃ feed into two hidden ReLU neurons that join at a single ReLU output a. The three weights and three biases (w₁, w₂, w₃, b₁, b₂, b₃) are yours to play with. As you change them, the prediction surface on the right re-draws — it plots a as a height over the (x₁, x₂) plane, with x₃ swept by its slider. The forward pass is that mapping from input space to output.

A 2-layer network — and its prediction surface

Drag the weight/bias sliders and watch the surface change shape. Sweep x₃ with its slider to lift / fold the surface. Because every neuron has a ReLU, the surface is piecewise linear — each ReLU contributes a sharp fold. Click and drag the surface to rotate.

w₁+0.80 w₂−0.60 w₃+0.50 b₁+0.20 b₂+0.00 b₃+0.00 x₃+0.00

Forward pass: h₁ = ReLU(w₁·x₁ + w₂·x₂ + b₁) h₂ = ReLU(w₃·x₃ + b₂) a = ReLU(h₁ + h₂ + b₃)

Loss vs target: — best: —

loss over time → each slider tweak adds a new point

The same thing, in code

Here's the network we've been playing with, written out as a small class hierarchy: Neuron → Layer → MLP. This is essentially how Karpathy's micrograd packages neural networks. The Neuron.__call__ method is doing exactly what the circles in the diagram do — weighted sum of inputs, plus bias, through a ReLU.

import random

class Neuron:
    def __init__(self, nin):
        self.w = [random.uniform(-1, 1) for _ in range(nin)]
        self.b = random.uniform(-1, 1)

    def __call__(self, x):
        # forward pass: a = ReLU(w · x + b)
        z = sum(wi * xi for wi, xi in zip(self.w, x)) + self.b
        return max(0, z)

class Layer:
    def __init__(self, nin, nout):
        self.neurons = [Neuron(nin) for _ in range(nout)]

    def __call__(self, x):
        return [n(x) for n in self.neurons]

class MLP:
    def __init__(self, nin, nouts):
        sizes = [nin] + nouts
        self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(nouts))]

    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

# 3 inputs → 2 hidden neurons → 1 output  (one forward pass)
x   = [1.0, 0.5, -0.3]
mlp = MLP(3, [2, 1])
print(mlp(x))   # e.g.  [0.42]

How this maps to the viz

The MLP(3, [2, 1]) above is slightly more general than the network in the diagram. In a standard MLP every input feeds every hidden neuron, so the first layer alone would have 2 × (3 weights + 1 bias) = 8 parameters. The interactive diagram uses a deliberately restricted variant — h₁ sees only x₁, x₂, and h₂ sees only x₃ — so we end up with just 3 weights and 3 biases. That's small enough that the prediction surface stays readable as you wiggle the sliders. The Neuron / Layer / MLP scaffolding is identical either way.

Predict the outputs

Here's a small batch of inputs. Using the MLP class above, write code that produces predictions for each one:

xs = [
    [ 2.0,  3.0, -1.0],
    [ 3.0, -1.0,  0.5],
    [ 0.5,  1.0,  1.0],
    [ 1.0,  1.0, -1.0],
]
ys_target = [1.0, -1.0, -1.0, 1.0]   # what we WISH the network said
ypred = ?                             # ← your job

Show answer

One line: ypred = [mlp(x) for x in xs]. With random weights you'll get whatever the freshly-initialized model says — almost certainly nothing like ys_target.

Bonus observation: our network's output is wrapped in a ReLU, so ypred[i] ≥ 0 for every input. That means we can never match a target of −1.0 no matter what the weights are. To handle negative targets we'd need a different output activation (or none). This is a real design choice in real models — the output activation has to match the kind of answer you want.

What is loss?

Once we have predictions, the obvious question is: how wrong are we? The standard way to turn that question into a single number is a loss function. The simplest one — mean squared error (MSE) — just averages the squared gap between each prediction and its target:

$$ L = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 $$

A few properties worth internalizing:

Loss is always ≥ 0 — squared gaps can't be negative.
Loss = 0 means perfect predictions — every ŷᵢ exactly hits its target yᵢ.
Big gaps cost much more than small ones — because they're squared. A model that's off by 2 on one example loses 4× more than one that's off by 1.
Loss is the only thing the optimizer cares about — every weight in the model will be nudged in whichever direction makes this single number smaller.

This is the whole game of training: find weights that minimize the loss.

Now try it yourself

Scroll back to the interactive diagram and click 🎯 Train against a target surface. A hidden target network is generated, its surface is overlaid as a dark wireframe, and the live loss appears as both a number and a bar. The little chart underneath records every loss reading — as you nudge sliders, you can literally watch the line go down (or up — easy to make it worse). See if you can get the loss below 0.02 by hand. It's harder than it looks — and that's the whole motivation for the gradient-based training we'll build in the next section.

The weights and biases in our code are still plain Python floats, so we can run the model and measure the loss but we can't yet ask "which weight should I nudge, and by how much, to reduce the loss?". To answer that, we need gradients — and that's exactly what the next section is about.

Autograd

Training a neural network requires gradients: for each parameter in the model, we need to know "if I nudge this number up a little, does the loss go up or down, and by how much?". The computation graph has many inputs (the model parameters and input tokens) but funnels down to a single scalar output: the loss. Backpropagation starts at that single output and works backwards through the graph, computing the gradient of the loss with respect to every input. It relies on the chain rule from calculus. In production, libraries like PyTorch handle this automatically. Here, we implement it from scratch in a single class called Value.

Companion lecture

This is the most mathematically intense part of microgpt. Karpathy has a 2.5-hour video that builds the whole thing live: The spelled-out intro to neural networks and backpropagation. The walk-through below condenses the key points.

Building `Value` piece by piece

The same Lego mindset works here: start with a wrapper, add operators, then add the graph bookkeeping that makes backprop possible. Try it live:

A Value object — built up in three stages

Type numbers and toggle the stages to see what Value remembers at each version of the class. Stage 3 is what microgpt actually uses.

a b

Value, in four steps

Step 1 — A scalar that prints itself. __repr__ is the dunder Python calls when you print an object.

class Value:
    def __init__(self, data):
        self.data = data

    def __repr__(self):
        return f"Value(data={self.data})"

a = Value(-6.0)
b = Value(7.0)
print(a)   # Value(data=-6.0)
print(b)   # Value(data=7.0)

Step 2 — Teach Value arithmetic. __add__ and __mul__ are the dunders Python calls when you write a + b or a * b. We return a fresh Value.

class Value:
    def __init__(self, data):
        self.data = data
    def __repr__(self):
        return f"Value(data={self.data})"

    def __add__(self, other):
        return Value(self.data + other.data)
    def __mul__(self, other):
        return Value(self.data * other.data)

a = Value(-6.0); b = Value(7.0); c = Value(10.0)
d = a * b + c
print(d)   # Value(data=-32.0)

Step 3 — Neural networks are computation graphs, so a node needs to remember what produced it. Each operation records its inputs as _children.

class Value:
    def __init__(self, data, children=()):
        self.data = data
        self._children = children       # the values that produced this one

    def __add__(self, other):
        return Value(self.data + other.data, (self, other))
    def __mul__(self, other):
        return Value(self.data * other.data, (self, other))

a = Value(2.0)
b = Value(3.0)
c = a * b                              # c knows its children are (a, b)
L = c + a                              # L knows its children are (c, a)

Step 4 — Each operation also records its local derivative. backward() walks the graph in reverse topological order, applying the chain rule and accumulating gradients.

class Value:
    __slots__ = ('data', 'grad', '_children', '_local_grads')

    def __init__(self, data, children=(), local_grads=()):
        self.data = data                # forward-pass scalar
        self.grad = 0                   # dL/d(this), filled in backward pass
        self._children = children       # inputs to this node
        self._local_grads = local_grads # d(this)/d(child) for each child

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        return Value(self.data + other.data, (self, other), (1, 1))

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        return Value(self.data * other.data, (self, other), (other.data, self.data))

    def __pow__(self, other):  return Value(self.data**other, (self,), (other * self.data**(other-1),))
    def log(self):             return Value(math.log(self.data), (self,), (1/self.data,))
    def exp(self):             return Value(math.exp(self.data), (self,), (math.exp(self.data),))
    def relu(self):            return Value(max(0, self.data), (self,), (float(self.data > 0),))

    def __neg__(self):           return self * -1
    def __radd__(self, other):   return self + other
    def __sub__(self, other):    return self + (-other)
    def __rsub__(self, other):   return other + (-self)
    def __rmul__(self, other):   return self * other
    def __truediv__(self, other):  return self * other**-1
    def __rtruediv__(self, other): return other * self**-1

    def backward(self):
        # 1) Build reverse-topological order via DFS
        topo, visited = [], set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._children:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        # 2) Seed the loss gradient, then propagate
        self.grad = 1
        for v in reversed(topo):
            for child, local_grad in zip(v._children, v._local_grads):
                child.grad += local_grad * v.grad

Briefly, a Value wraps a single scalar number (.data) and tracks how it was computed. Think of each operation as a little Lego block: it takes some inputs, produces an output (the forward pass), and it knows how its output would change with respect to each of its inputs (the local gradient). That's all the information autograd needs from each block. Everything else is just the chain rule, stringing the blocks together.

Every time you do math with Value objects (add, multiply, etc.), the result is a new Value that remembers its inputs (_children) and the local derivative of that operation (_local_grads). For example, __mul__ records that $\frac{\partial(a\cdot b)}{\partial a}=b$ and $\frac{\partial(a\cdot b)}{\partial b}=a$. The full set of Lego blocks:

Operation	Forward	Local gradients
`a + b`	$a+b$	$\partial/\partial a = 1,\; \partial/\partial b = 1$
`a * b`	$a \cdot b$	$\partial/\partial a = b,\; \partial/\partial b = a$
`a ** n`	$a^n$	$\partial/\partial a = n\,a^{n-1}$
`log(a)`	$\ln a$	$\partial/\partial a = 1/a$
`exp(a)`	$e^a$	$\partial/\partial a = e^a$
`relu(a)`	$\max(0,a)$	$\mathbf{1}_{a>0}$

The backward() method walks this graph in reverse topological order (starting from the loss, ending at the parameters), applying the chain rule at each step. If the loss is $L$ and a node $v$ has a child $c$ with local gradient $\frac{\partial v}{\partial c}$, then:

$$\frac{\partial L}{\partial c} \mathrel{+}= \frac{\partial v}{\partial c}\cdot\frac{\partial L}{\partial v}$$

This looks scary if you're not comfortable with calculus, but it's literally just multiplying two numbers in an intuitive way: "If a car travels twice as fast as a bicycle, and the bicycle is four times as fast as a walking man, then the car travels 2×4 = 8 times as fast as the man." The chain rule is the same idea — you multiply the rates of change along the path.

We kick things off by setting self.grad = 1 at the loss node, because $\frac{\partial L}{\partial L}=1$. From there, the chain rule just multiplies local gradients along every path back to the parameters.

Note the += (accumulation, not assignment). When a value is used in multiple places in the graph (i.e. the graph branches), gradients flow back along each branch independently and must be summed. This is the multivariable chain rule: if $c$ contributes to $L$ through multiple paths, the total derivative is the sum of contributions from each path.

After backward() completes, every Value in the graph has a .grad containing $\frac{\partial L}{\partial v}$, which tells us how the final loss would change if we nudged that value.

Watch backprop happen

Backprop is easier to internalize if you build it up. Below are four cases in increasing complexity — start with what a single + does to a gradient, then a single ×, then both with a branch, then a full training-style pipeline (input, prediction, loss). Each tab is its own little graph; step through it one click at a time.

Backprop in four shapes

Each case shows: forward construction → topological sort → backward pass propagating gradients. The local derivative on each edge is the only thing each operation has to know.

Press Next step to begin building the computation graph.

Topo list: (empty)

step 0 / 14

Chain rule worked out

Math appears here as you step through.

By hand — same algorithm, different shape

Here's a small neuron computing a = ReLU(x·w + b). The forward values are filled in. Try to compute the gradients with respect to x, w, and b by hand assuming ∂L/∂a = 1. Then click "Run backward" to check. Doing this once by hand is the single best way to internalize what backward() is doing.

This is exactly what PyTorch's .backward() gives you:

import torch
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
c = a * b
L = c + a
L.backward()
print(a.grad)   # tensor(4.)
print(b.grad)   # tensor(2.)

This is the same algorithm that PyTorch's loss.backward() runs, just on scalars instead of tensors (arrays of scalars) — algorithmically identical, significantly smaller and simpler, but a lot less efficient.

Let's spell out what backward() gives us. Autograd calculated that if L = a*b + a, with a=2 and b=3, then a.grad = 4.0. This is telling us about the local influence of a on L: if you wiggle a, in what direction is L changing? The derivative of L w.r.t. a is 4.0, meaning that if we increase a by a tiny amount (say 0.001), L would increase by about 4× that (0.004). Similarly, b.grad = 2.0 means the same nudge to b would increase L by about 2× that. These gradients tell us the direction (positive or negative) and the steepness (magnitude) of each input's influence on the final output (the loss). This lets us iteratively nudge the parameters of our neural network to lower the loss, and hence improve its predictions.

Gradient descent · w and b chase the target

Case 4 above computed the gradients for one step: ∂loss/∂b = −6 and (chaining through x) ∂loss/∂w = 2·err·x = −18. Now we repeat the step. x = 3 and the target y = 10 are fixed; each step nudges the two parameters against their gradient — w ← w − lr·∂loss/∂w — and you watch the prediction ŷ climb toward the target while the loss shrinks. The nudge is just the gradient multiplied by the learning rate. Click Next step a few times.

x = 3 target y = 10 learning rate = 0.02

Computation graph — node values and local gradients (∂) recompute every step:

loss = err²

step 0 / 4

Architecture

The model architecture is a stateless function: it takes a token, a position, the parameters, and the cached keys/values from previous positions, and returns logits (scores) over what token the model thinks should come next in the sequence. We follow GPT-2 with minor simplifications: RMSNorm instead of LayerNorm, no biases, and ReLU instead of GeLU.

We'll step through the model one block at a time. Each sub-section below covers one piece — first the intuition, then any small helper functions it needs, then the relevant code, the actual parameter matrices, and finally a small interactive widget showing what we've built up so far.

Model config at a glance

HyperparameterToy (this walkthrough)Real microgptGPT-2 small (ref)

vocab_size4 (BOS, a, b, c)27 (a–z + BOS)50,257

n_embd · d_model216768

n_head1412

head_dim2464

block_size (context len)4161,024

n_layer1112

parameters (rough)~704,192~124M

All three columns run the same code. The widgets below render the toy column; the "Scaling up" callouts in each subsection map the toy back to the real-microgpt and GPT-2 columns.

A running toy example · d_model = 2

To make each step concrete, we'll track a single token through the whole block using a deliberately tiny model. The vector at each stage will only have two numbers, so you can do every multiplication by hand and watch what changes.

Setup. Pretend the vocabulary is just 4 tokens — BOS=0, 'a'=1, 'b'=2, 'c'=3 — and the embedding width is d_model = 2, with n_head = 1 (so head_dim = 2) and block_size = 4. We're partway through generating: the model has already seen BOS at position 0 and 'a' at position 1, and now it's processing 'b' at position 2. We want it to predict what comes at position 3.

Each subsection below pulls in the toy weights it needs, walks the numbers forward, and the resulting vector becomes the input to the next subsection. By the end of Output, we'll have one concrete probability over the 4-token vocab.

Embeddings

The neural network can't process a raw token id like 2 directly. It only works with vectors (lists of numbers). So we associate a learned vector with each possible token, and feed that in as its neural signature. The token id and position id each look up a row from their respective embedding tables (wte and wpe). These two vectors are added together, giving the model a representation that encodes both what the token is and where it is in the sequence. Modern LLMs usually skip the position embedding and use relative-based positioning schemes like RoPE.

Concrete example: say our current token is 'b', which the tokenizer mapped to id 2, sitting at position 2. The lookup wte[2] gives a length-2 vector — that's the x the network actually sees. Click a different letter below and you'll watch a different row of wte get pulled in and flow all the way through the three views (and the numeric tour at the bottom of the section).

The block so far — Embeddings only

Pick the current token. Its row of wte becomes x; wpe[pos=2] gets added; that vector flows through every downstream view. The fine-grained sliders at the bottom of the section still work for off-vocabulary values.

sequence so far → pos 0BOS · pos 1'a' · pos 2'b'current → wpe[2] = [−0.10, 0.10]

current token → token_id = 2 → x = wte[2] = [−0.30, 0.40]

Same step, drawn as neurons

Parameter matrices

Two learned tables — one row per token, one row per position. Hover any cell to see its value. The pattern is just random Gaussian initialisation (std = 0.08); training reshapes these into something meaningful.

wte · token embedding

toy d = 2 · (4 × 2) — real microgpt is (27 × 16)

hover a cell

wpe · position embedding

toy d = 2 · (4 × 2) — real microgpt is (16 × 16)

hover a cell

Helper used here · `rmsnorm`

Once we've added the token and position vectors, we normalize. rmsnorm (Root Mean Square Normalization) rescales a vector so its values have unit root-mean-square. This keeps activations from growing or shrinking as they flow through the network, stabilizing training. It's a simpler variant of the LayerNorm used in the original GPT-2.

def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]

Code in `gpt()`

tok_emb = state_dict['wte'][token_id]      # length 16
pos_emb = state_dict['wpe'][pos_id]        # length 16
x = [t + p for t, p in zip(tok_emb, pos_emb)]
x = rmsnorm(x)

Toy walkthrough · embeddings at d = 2

Our token is 'b' (id 2) at position 2. Pick tiny wte and wpe tables to look up from:

# wte: 4 rows (one per token), each a length-2 vector
wte = [[ 0.20,  0.30],   # BOS
       [ 0.50, -0.10],   # 'a'
       [-0.30,  0.40],   # 'b'
       [ 0.10,  0.20]]   # 'c'

# wpe: 4 rows (one per position), each a length-2 vector
wpe = [[ 0.10, -0.05],   # pos 0
       [ 0.05,  0.15],   # pos 1
       [-0.10,  0.10],   # pos 2
       [ 0.15,  0.00]]   # pos 3

token_id, pos_id = 2, 2
tok_emb = wte[token_id]                        # → [-0.30,  0.40]
pos_emb = wpe[pos_id]                          # → [-0.10,  0.10]
x = [t + p for t, p in zip(tok_emb, pos_emb)]  # → [-0.40,  0.50]
x = rmsnorm(x)                                 # → [-0.88,  1.10]

Doing the RMSNorm by hand. Mean-square: $((-0.40)^2 + 0.50^2)/2 = 0.205$. Scale: $1/\sqrt{0.205 + 10^{-5}} \approx 2.209$. Multiply through: $[-0.40 \cdot 2.209,\; 0.50 \cdot 2.209] \approx [-0.88, 1.10]$. That two-number vector $x \approx [-0.88,\, 1.10]$ is what the attention block sees next.

Scaling up · real microgpt and bigger

In microgpt (n_embd = 16): wte is (27 × 16) and wpe is (16 × 16), so the looked-up vectors are length 16 instead of 2 — same two lines of code, just longer lists. RMSNorm averages 16 squared values instead of 2.

In GPT-2 small (n_embd = 768): each row is a 768-dim vector, and the vocabulary jumps to 50,257 tokens, so wte alone is ≈ 39M parameters. GPT-3 (175B): n_embd = 12,288 and the context window stretches to 2,048 positions; modern frontier models push past 100K positions and skip wpe entirely in favor of relative position schemes like RoPE that rotate the Q/K vectors inside attention instead of adding a position vector here.

The block so far — Embeddings + RMSNorm

The pre-attention RMSNorm scales the joint x + wpe vector to unit root-mean-square, so the activations don't blow up as they flow into Q/K/V. Same picker drives this view — try BOS / 'a' / 'b' / 'c' and watch the normalized vector update.

sequence so far → pos 0BOS · pos 1'a' · pos 2'b'current → wpe[2] = [−0.10, 0.10]

current token → token_id = 2 → x = wte[2] = [−0.30, 0.40]

Block so far, drawn as neurons

Attention block

The attention block is the only place where a token at position $t$ gets to "look" at tokens at positions $0 \ldots t-1$. It's a token-communication mechanism. Before we dive into the code, here's the intuition that makes the rest of this section click.

The intuition, dictionary metaphor, and matrix-form diagrams in this section are adapted from "Attention, Please!": A Visual Guide To The Attention Mechanism by CodeCompass — recommended reading if you want the same ideas in a different voice.

Intuition · attention is a fuzzy dictionary

Here is what the attention equation looks like. Don't get intimidated — we're going to break each piece down. Attention is a "learnable", "fuzzy" version of a key-value store — the same data structure you know as a Python dict or a hashtable.

$$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$$

Traditional dictionary

A dictionary takes a key and maps it to a single value. Query "Italy" matches the key "Italy" exactly, the corresponding value "Rome" is returned. No partial match, no in-between.

Attention generalizes this to a non-binary lookup. Instead of matching the query to exactly one key, the query is compared to every key, each match gets a similarity score, and the output is a weighted blend of all the values — keys with higher scores contribute more. Critically, the queries, keys, and values are D-dimensional learned vectors (computed by Wq, Wk, Wv from the input), so the model gets to decide what "matching" means.

Attention mechanism

The query q is compared with every key K_i to produce a similarity score S_i. Softmax normalizes those scores into weights between 0 and 1, summing to 1 (a well-behaved probability distribution). Each value V_i is multiplied by its score and the products are summed — this weighted sum is the attention output.

Why softmax? Raw dot-product scores can be any real number. Softmax squashes them into the range [0, 1] and forces them to sum to 1, like a well-behaved probability distribution — so the output really is a weighted average, not just a weighted sum that could explode.

What does attention do?

Attention is applied to the input sequence and generates weights for what is of importance to each query. Those weights then "pick" the relevant information and pass it on to the next layer. To make this concrete, take the sentence "The quick brown fox jumps over the lazy dog." Click any word below to see where its attention goes — every other word in the sentence gets a similarity score against your chosen query word, and the bar chart shows the resulting weights.

Attention for one query word at a time

Pick any word as the query. Its query vector (computed by Wq) is dot-producted with every word's key vector (computed by Wk) to get raw scores; softmax turns those into the attention weights you see below. The highest-weighted word is what this query is "looking at." Numbers are illustrative — a real trained model would produce its own pattern.

Helpers used here · `linear` and `softmax`

linear is a matrix-vector multiply. It takes a vector x and a weight matrix w, and computes one dot product per row of w. It shows up four times in this block — once each for Q, K, V, and the output projection Wo — and is the fundamental building block of neural networks: a learned linear transformation.

def linear(x, w):
    return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]

softmax converts a vector of raw scores — which can range from $-\infty$ to $+\infty$ — into a probability distribution: all values end up in $[0,1]$ and sum to 1. Inside attention we use it to turn the Q·K scores into weights that sum to 1; later, the same helper turns the model's output logits into a distribution over the vocabulary. We subtract the max first for numerical stability (mathematically a no-op, but it prevents overflow in exp).

def softmax(logits):
    max_val = max(val.data for val in logits)
    exps = [(val - max_val).exp() for val in logits]
    total = sum(exps)
    return [e / total for e in exps]

Now that both helpers are on the table, let's walk through the whole attention block with concrete numbers before opening up the interactive widgets. The widgets below are just visualizations of the operations that follow — once you've seen the math run end-to-end on real values, each widget will feel like a labeled view of a step you've already done by hand.

We pick up where the Embeddings walkthrough left off: token 'b' at position 2, with x ≈ [-0.88, 1.10] already in hand.

Toy walkthrough · attention at d = 2

The embedding step handed us x ≈ [-0.88, 1.10]. We stash it as the residual and re-normalize before projecting (the second RMSNorm on an already-normalized vector is nearly a no-op — scale ≈ 1.00 — so the input to the projections is still [-0.88, 1.10]).

x_residual = x                                  # [-0.88, 1.10]
x = rmsnorm(x)                                  # ≈ [-0.88, 1.10]

# Toy Q/K/V/Wo weight matrices, each (2 × 2)
attn_wq = [[ 0.50,  0.20], [ 0.10,  0.40]]
attn_wk = [[ 0.30, -0.10], [ 0.20,  0.50]]
attn_wv = [[ 0.40,  0.10], [-0.20,  0.60]]
attn_wo = [[ 0.60,  0.20], [ 0.10,  0.70]]

q = linear(x, attn_wq)   # → [-0.22,  0.35]
k = linear(x, attn_wk)   # → [-0.37,  0.37]
v = linear(x, attn_wv)   # → [-0.24,  0.84]

Why those numbers? Each row of the weight matrix is a dot product with x. For q: row 0 gives $0.50(-0.88) + 0.20(1.10) = -0.22$; row 1 gives $0.10(-0.88) + 0.40(1.10) = 0.35$. Same shape for k and v.

KV cache. Positions 0 and 1 have already been processed on earlier calls, so the cache holds:

keys[0]   = [[ 0.30,  0.10],   # k from BOS at pos 0
             [-0.10,  0.40],   # k from 'a'  at pos 1
             [-0.37,  0.37]]   # k from 'b'  at pos 2 (just appended)

values[0] = [[ 0.20, -0.30],   # v from BOS
             [ 0.50,  0.20],   # v from 'a'
             [-0.24,  0.84]]   # v from 'b'

Why keys[0] instead of just keys? Each Transformer layer keeps its own separate KV cache — the keys and values learned at layer 0 mean different things than at layer 1. So keys and values are lists of lists: the outer index is the layer number, the inner index is the position in the sequence. keys[0] is "the running list of every k vector layer 0 has produced so far," and keys[0][2] is "the key for position 2 at layer 0." Our toy has n_layer = 1, so keys[0] is the only list around — but the indexing convention stays the same. If we bumped n_layer to 6, you'd see keys[0], keys[1], … through keys[5], one cache per layer.

Scores → softmax weights. Dot each cached key with our query, divide by $\sqrt{d_{\text{head}}} = \sqrt{2} \approx 1.41$:

scores = [(q[0]*k[0] + q[1]*k[1]) / 1.41 for k in keys[0]]
# pos 0: (-0.22·0.30 + 0.35·0.10)/1.41 = -0.031/1.41 ≈ -0.02
# pos 1: (-0.22·-0.10 + 0.35·0.40)/1.41 =  0.162/1.41 ≈  0.11
# pos 2: (-0.22·-0.37 + 0.35·0.37)/1.41 =  0.211/1.41 ≈  0.15
weights = softmax(scores)                       # ≈ [0.30, 0.34, 0.36]

The three weights sum to 1. Notice that 'b' attends most to itself (0.36), then to 'a' (0.34), then to BOS (0.30) — the differences are small because our toy weights are tiny and random; a trained network would learn much sharper patterns.

Weighted sum of values, then mix through Wo, then residual.

head_out = [sum(weights[t] * v[t][j] for t, v in enumerate(values[0]))
            for j in range(2)]
# head_out ≈ [0.30·0.20 + 0.34·0.50 + 0.36·-0.24,
#             0.30·-0.30 + 0.34·0.20 + 0.36·0.84]
#          ≈ [0.14, 0.28]

x_attn = linear(head_out, attn_wo)              # ≈ [0.14, 0.21]
x = [a + b for a, b in zip(x_attn, x_residual)] # ≈ [-0.74, 1.31]

Why each of those three lines is there.

Weighted sum of V. This is the actual "lookup" of the fuzzy dictionary. The weights answered how much each past position matters; the values say what each one contributes. Multiplying them and summing gives a single vector that's a blended pull from every cached value, weighted by relevance. If one weight were 1.0 and the rest were 0, we'd get back exactly that value — like a normal dict lookup. With soft weights, we get a mix.
Project through Wo. The weighted sum lives in value-space, not in the residual stream's space. Wo is a learned linear layer that re-mixes the head output back into the same shape as x. In multi-head attention each head's slice gets concatenated first, then Wo blends across the heads — giving the model a place to learn how different heads should be combined. In our toy with one head it just rotates the 2-vector, but the role is the same.
Add the residual. Instead of replacing x with x_attn, we add: x ← x + x_attn. Two big wins. (1) The original information survives — attention is an update, not an overwrite. (2) During backprop, gradients flow directly through this addition path back to earlier layers, which is what makes deep stacks of these blocks trainable at all. If attention has nothing useful to say for this token, it can output zero and the residual just passes x through unchanged.

The vector handed to the MLP block is x ≈ [-0.74, 1.31]. The attention block has done one thing: blended a little bit of every past position into the current one, projected the result back into the residual stream's shape, and added it on as an update.

Attention, step by step · with the KV cache

Same numbers as the toy walkthrough above, drawn as a flow. Use ▶ Play for the full animation or Step to advance one phase at a time. Phases: (1) compute q, k, v for the current token, (2) append k and v to the per-layer caches, (3) score the query against each cached key, (4) softmax → weights, (5) weighted sum of cached values, (6) output.

Step 0 — Reset

Attention playground · drag the query, watch the block recompute

Same diagram as "Attention, step by step" above, but now the query vector q is on sliders. The KV cache (3 past tokens) stays pinned to the toy walkthrough; everything downstream — scaled-dot-product scores, softmax weights, weighted sum of values, head output — recomputes live as you drag. Start at the defaults (q ≈ [−0.22, 0.35], the toy 'b' values) and move the sliders to see how a different query reshapes the whole attention output.

q[0] −0.22

q[1] 0.35

Next-letter preference · head output → lm_head → softmax

Shortcut visualization. In the real model the head output goes through Wo, gets added to the residual, runs the MLP block, and only then does lm_head + softmax produce next-letter probabilities. We're skipping those layers and projecting the head output directly through lm_head so you can see how moving the query changes which letter the model "leans toward." It's a directional signal, not the model's real prediction.

Snapping attention into the running diagram

We started this section with Embeddings only, added the pre-attention rmsnorm, and just walked through the full attention computation step by step. Time to slot that attention block back into the architecture diagram we've been building piece by piece. The widget below adds Q/K/V projections, the attention weighted sum, W_o, and the residual add on top of the Embeddings + RMSNorm view from earlier — same token picker, same numbers, just more of the block lit up.

The block so far — through the attention residual

Embeddings + RMSNorm + Q/K/V + attention + Wₒ + residual. The MLP comes next.

sequence so far → pos 0BOS · pos 1'a' · pos 2'b'current → wpe[2] = [−0.10, 0.10]

current token → token_id = 2 → x = wte[2] = [−0.30, 0.40]

Block so far, drawn as neurons

Parameter matrices

Four 16×16 matrices: Q/K/V are the three projections that turn the token vector into "what am I looking for / what do I contain / what do I offer", and Wₒ mixes the per-head outputs back together.

You might be wondering why the toy matrices below are only 2×2. Remember from the Embeddings step: each token gets embedded as a two-dimensional vector (we set d_model = 2 for the walkthrough). The Q/K/V projections map a length-2 vector to another length-2 vector, so the weight matrix is (out × in) = (2 × 2) = 4 numbers. In real microgpt d_model = 16, so each of these matrices grows to (16 × 16) = 256 numbers. The shape of the operation is the same — just bigger.

Q / K / V projections as one network · hover any cell below to see which edge it is

Each cell in attn_wq, attn_wk, or attn_wv is a single connection in this network. The cell M[i][j] is the weight on the edge from x̂[j] (top) to the i-th output of that projection. Twelve cells across three matrices, twelve edges in the diagram.

attn_wq

toy d = 2 · (2 × 2) — real microgpt is (16 × 16)

hover a cell

attn_wk

toy d = 2 · (2 × 2) — real microgpt is (16 × 16)

hover a cell

attn_wv

toy d = 2 · (2 × 2) — real microgpt is (16 × 16)

hover a cell

attn_wo

toy d = 2 · (2 × 2) — real microgpt is (16 × 16)

hover a cell

W_o · head output → residual update

Wo is different from Q/K/V. The three projections above act on x̂ (the normalized residual stream), but Wo acts on Σ wᵢ vᵢ — the weighted sum of values coming out of attention. Its job is to re-mix that head-output vector back into the shape of the residual stream so it can be added on top. In multi-head attention Wo also mixes information across heads. Same cell-to-edge convention: hover any cell in attn_wo to highlight the corresponding edge here.

Code in `gpt()`

x_residual = x
x = rmsnorm(x)
q = linear(x, state_dict[f'layer{li}.attn_wq'])
k = linear(x, state_dict[f'layer{li}.attn_wk'])
v = linear(x, state_dict[f'layer{li}.attn_wv'])
keys[li].append(k); values[li].append(v)
# ... heads loop: scores → softmax → weighted V → concat ...
x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
x = [a + b for a, b in zip(x, x_residual)]            # residual

Scaling up · real microgpt and bigger

In microgpt (n_embd = 16, n_head = 4): Q/K/V are (16 × 16) and they get sliced into 4 heads of head_dim = 4 each. The same Q·K/√d · softmax · weighted-V dance runs per head on a 4-dim slice, the four outputs are concatenated back to length 16, and Wo mixes them. The "shape" of the math doesn't change — just the dimensions.

In GPT-2 small (n_embd = 768, n_head = 12): each head sees a 64-dim slice, and there are 12 of them running in parallel. GPT-3 (175B, n_embd = 12288, n_head = 96): 128-dim slices, 96 heads, all 96 looking back at thousands of cached positions. Frontier models add tricks like grouped-query attention (many query heads share the same K/V heads, shrinking the KV cache) and FlashAttention (a GPU-friendly tiling that never materialises the full attention matrix), but the per-head computation is still the four lines you just walked through.

MLP block

MLP is short for "multilayer perceptron" — a two-layer feed-forward network: project up to 4× the embedding dimension, apply ReLU, project back down. This is where the model does most of its "thinking" per position. Unlike attention, this computation is fully local to time $t$. The Transformer intersperses communication (Attention) with computation (MLP).

The full Transformer block

Now the whole thing. The transformer block on the left and the neural network on the right are the same thing — same weights, same data flowing in the same order. Hover any block, layer, or row and the matching parts in all three views (including the numeric tour below) light up. Click to pin the code panel.

sequence so far → pos 0BOS · pos 1'a' · pos 2'b'current → wpe[2] = [−0.10, 0.10]

current token → token_id = 2 → x = wte[2] = [−0.30, 0.40]

Full block, drawn as neurons

Code in microgpt hover or click any block

Hover an architecture block (any of the three widgets above), a layer in the neural-network drawing, or a row in the numeric tour below — the matching microgpt lines will appear here. Click to pin.

Parameter matrices

Up-projection then down-projection. mlp_fc1 blows the dimension up 4× to give the network room to compute, then mlp_fc2 squeezes it back down so it can be added to the residual stream.

MLP · 2 → 8 → 2 as one network · hover any cell below to see its edge

The MLP is a two-layer feed-forward network. mlp_fc1 projects the 2-dim input up to an 8-dim hidden vector, ReLU zeroes out the negatives, and mlp_fc2 projects back down to 2-dim so it can be added to the residual. Hover any cell: 16 cells in mlp_fc1 map to the 16 edges in the top fan; 16 cells in mlp_fc2 map to the 16 edges in the bottom fan.

mlp_fc1

toy d = 2 · (8 × 2) — real microgpt is (64 × 16)

hover a cell

mlp_fc2

toy d = 2 · (2 × 8) — real microgpt is (16 × 64)

hover a cell

Code in `gpt()`

x_residual = x
x = rmsnorm(x)
x = linear(x, state_dict[f'layer{li}.mlp_fc1'])    # 16 → 64
x = [xi.relu() for xi in x]
x = linear(x, state_dict[f'layer{li}.mlp_fc2'])    # 64 → 16
x = [a + b for a, b in zip(x, x_residual)]          # residual

Toy walkthrough · MLP at d = 2 (hidden = 8)

The attention block handed us x ≈ [-0.74, 1.31]. Stash the residual, normalize, then up-project to 4 × d_model = 8 hidden units.

x_residual = x                                   # [-0.74, 1.31]
x = rmsnorm(x)                                   # ≈ [-0.70, 1.23]

# mlp_fc1: up-projection (8 × 2)
mlp_fc1 = [[ 0.40,  0.10],
           [-0.20,  0.50],
           [ 0.30, -0.30],
           [ 0.10,  0.40],
           [-0.50,  0.20],
           [ 0.20, -0.10],
           [ 0.60,  0.30],
           [-0.10, -0.40]]

pre = linear(x, mlp_fc1)
# = [-0.16,  0.76, -0.58,  0.42,  0.60, -0.26, -0.05, -0.42]

x = [xi.relu() for xi in pre]
# = [ 0.00,  0.76,  0.00,  0.42,  0.60,  0.00,  0.00,  0.00]

Why most entries are zero. ReLU = max(0, x), so anything negative gets clipped to 0. Only 3 of the 8 hidden units "fire" for this particular input. Different inputs would activate different subsets — that's how the MLP carves the input space into pieces and treats each piece differently.

# mlp_fc2: down-projection (2 × 8)
mlp_fc2 = [[ 0.10,  0.30, -0.20,  0.40,  0.00,  0.20, -0.10,  0.50],
           [-0.30,  0.20,  0.50, -0.10,  0.40, -0.40,  0.30,  0.10]]

mlp_out = linear(x, mlp_fc2)                     # ≈ [0.40, 0.35]
x = [a + b for a, b in zip(mlp_out, x_residual)] # ≈ [-0.34, 1.66]

Hand-check the down-projection. Row 0 of mlp_fc2 dotted with the post-ReLU vector: $0.30 \cdot 0.76 + 0.40 \cdot 0.42 = 0.396 \approx 0.40$ (the zeros contribute nothing). The MLP's contribution gets added back to the residual stream, and we exit the block with x ≈ [-0.34, 1.66].

Scaling up · real microgpt and bigger

In microgpt (n_embd = 16): mlp_fc1 is (64 × 16) and mlp_fc2 is (16 × 64) — the 4× expansion is the same; just wider vectors. The MLP holds more parameters than the attention block (2,048 vs 1,024 in microgpt), and that ratio gets worse as models grow.

In GPT-2 small (n_embd = 768): the hidden layer is 3,072 wide, so the MLP alone is ≈ 4.7M parameters per layer. In GPT-3 (175B): hidden = 49,152, and the MLP is ≈ 60% of all parameters in the model. Frontier models also swap plain ReLU for SwiGLU (a gated activation that needs three matrices instead of two) and replace the dense MLP with Mixture-of-Experts — many small MLPs of which a router picks 2 per token — to grow capacity without growing per-token compute.

Residual connections

Both the attention and MLP blocks add their output back to their input (x = [a + b for ...]). This lets gradients flow directly through the network and makes deeper models trainable.

Output

The final hidden state is projected to vocabulary size by lm_head, producing one logit per token in the vocabulary. In our case, that's just 27 numbers. Higher logit = the model thinks that corresponding token is more likely to come next.

The whole pipeline — block + lm_head + softmax

Everything in the Transformer block followed by lm_head and a softmax. The bottom row is the model's predicted probability distribution over the four toy vocabulary tokens (BOS / 'a' / 'b' / 'c'). Pick a current token below and watch the whole pipeline — including the prediction — recompute.

sequence so far → pos 0BOS · pos 1'a' · pos 2'b'current → wpe[2] = [−0.10, 0.10]

current token → token_id = 2 → x = wte[2] = [−0.30, 0.40]

Full pipeline + output, drawn as neurons

Parameter matrix

One row per token in the vocabulary. The final hidden state is dot-producted with each row to produce a logit. Higher dot product → that token is judged more likely to come next.

lm_head

toy d = 2 · (4 × 2) — real microgpt is (27 × 16)

hover a cell

Code in `gpt()`

logits = linear(x, state_dict['lm_head'])   # length 27
return logits

Toy walkthrough · output at d = 2

The MLP handed us x ≈ [-0.34, 1.66]. lm_head has one row per vocab token; the dot product of x with row i is the logit for token i.

# lm_head: 4 rows (one per token), 2 columns (d_model)
lm_head = [[ 0.30,  0.10],   # BOS
           [-0.20,  0.40],   # 'a'
           [ 0.50, -0.30],   # 'b'
           [-0.10,  0.60]]   # 'c'

logits = linear(x, lm_head)
# BOS:  0.30·-0.34 + 0.10·1.66  =  0.06
# 'a': -0.20·-0.34 + 0.40·1.66  =  0.73
# 'b':  0.50·-0.34 + -0.30·1.66 = -0.67
# 'c': -0.10·-0.34 + 0.60·1.66  =  1.03

Raw logits can be any real number. To turn them into a probability distribution we apply softmax — subtract the max for numerical stability, exponentiate, divide by the sum:

probs = softmax(logits)
# = [0.17, 0.32, 0.08, 0.43]    # P(BOS), P('a'), P('b'), P('c')

Reading the result. After BOS, a, b, this (untrained) toy model thinks the most likely next token is 'c' with probability 0.43. During training, the loss for this position would be $-\log p(\text{target})$ — if the true next token were BOS (end of word), the loss is $-\log 0.17 \approx 1.77$. Backprop would then tweak every weight we've used along the way to push P(BOS) up and the others down for next time.

Scaling up · real microgpt and bigger

In microgpt (n_embd = 16, vocab_size = 27): lm_head is (27 × 16), so the model outputs 27 logits — one per a–z plus BOS. The softmax over 27 categories is cheap.

In GPT-2 small (n_embd = 768, vocab = 50,257): the final matrix is ≈ 39M parameters and the softmax has to normalize across 50K categories — and during training that softmax is computed at every position in every sequence in the batch, which is a non-trivial fraction of total training compute. In GPT-4 / frontier models, vocabularies sit around 100K–200K tokens and the lm_head is typically tied to wte (same matrix used for both input embedding and output projection), saving a copy of those millions of parameters. The temperature / top-p tricks you see at inference all live downstream of this same logit vector.

Parameters

You've seen every parameter matrix in the architecture walkthrough above — wte, wpe, attn_wq/wk/wv/wo, mlp_fc1/fc2, lm_head. The Parameters section is just the bookkeeping: allocate them all in one place, store them in a single dictionary the optimizer can iterate over, and count the total.

n_embd = 16; n_head = 4; n_layer = 1; block_size = 16
head_dim = n_embd // n_head

matrix = lambda nout, nin, std=0.08: \
    [[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]

state_dict = {
    'wte':     matrix(vocab_size, n_embd),    # 27 × 16  → 432
    'wpe':     matrix(block_size, n_embd),    # 16 × 16  → 256
    'lm_head': matrix(vocab_size, n_embd),    # 27 × 16  → 432
}
for i in range(n_layer):
    state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)        # 256
    state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)        # 256
    state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)        # 256
    state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)        # 256
    state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)    # 1,024
    state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)    # 1,024

params = [p for mat in state_dict.values() for row in mat for p in row]
print(f"num params: {len(params)}")   # → 4192

Why bother with the flat params list? Because the optimizer doesn't care about the matrices — it just needs a single list of scalars to loop over and update. params is that list. GPT-2 had 1.6 billion entries in this list; modern LLMs have hundreds of billions.

Tally — where the 4,192 parameters live

Click a row to expand. Each bar's length is proportional to the parameter count. The model has 1 layer; production GPTs stack dozens, and that's where most of the parameter explosion comes from.

Total — params

Predict before you change the config

Suppose we bumped n_embd from 16 to 32 (everything else unchanged). Which matrices would grow, and by how much (4×? 2×? something else)? Roughly what's the new total parameter count?

Show answer

Every parameter matrix scales with n_embd. wte, wpe, lm_head are linear in n_embd (2×). The attention matrices (attn_wq/k/v/o) and MLP (mlp_fc1/2) are all (n_embd × n_embd) or (4·n_embd × n_embd), so they scale quadratically (4×). New rough total ≈ 2×(432+256+432) + 4×(256+256+256+256+1024+1024) = 2,240 + 12,288 ≈ 14,528 params. Doubling the width more than triples the model.

Putting it all together

Now that we've walked through each piece individually, here is the full gpt() function — one call processes one token and returns 27 logits over the vocabulary. Read top to bottom: embeddings → for each layer (attention block → MLP block) → final linear.

Now the model itself:

def gpt(token_id, pos_id, keys, values):

tok_emb = state_dict['wte'][token_id] # token embedding lookup

pos_emb = state_dict['wpe'][pos_id] # position embedding lookup

x = [t + p for t, p in zip(tok_emb, pos_emb)] # combine: x = wte + wpe

x = rmsnorm(x) # rmsnorm before the first layer

for li in range(n_layer): # microgpt has n_layer = 1, but the loop is general

# ---------- 1) Multi-head attention block ----------

x_residual = x # save residual

x = rmsnorm(x) # pre-attention rmsnorm

q = linear(x, state_dict[f'layer{li}.attn_wq']) # Q projection

k = linear(x, state_dict[f'layer{li}.attn_wk']) # K projection

v = linear(x, state_dict[f'layer{li}.attn_wv']) # V projection

keys[li].append(k) # append to this layer's K cache

values[li].append(v) # append to this layer's V cache

x_attn = []

for h in range(n_head): # per-head attention

hs = h * head_dim

q_h = q[hs:hs+head_dim] # this head's slice of Q

k_h = [ki[hs:hs+head_dim] for ki in keys[li]] # this head's slice of every cached K

v_h = [vi[hs:hs+head_dim] for vi in values[li]] # this head's slice of every cached V

attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5

for t in range(len(k_h))] # scores: one per past token

attn_weights = softmax(attn_logits) # softmax → weights, sum to 1

head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))

for j in range(head_dim)] # weighted sum of values

x_attn.extend(head_out) # concat into one length-16 vector

x = linear(x_attn, state_dict[f'layer{li}.attn_wo']) # Wo: mix across heads

x = [a + b for a, b in zip(x, x_residual)] # + residual

# ---------- 2) MLP block ----------

x_residual = x # save new residual

x = rmsnorm(x) # pre-MLP rmsnorm

x = linear(x, state_dict[f'layer{li}.mlp_fc1']) # up-project (16 → 64)

x = [xi.relu() for xi in x] # ReLU — zero out negatives

x = linear(x, state_dict[f'layer{li}.mlp_fc2']) # down-project (64 → 16)

x = [a + b for a, b in zip(x, x_residual)] # + residual

logits = linear(x, state_dict['lm_head']) # final projection → 27 logits

return logits

Hover any line above

The lines mirror the 12 steps from the architecture walkthrough. Hover any highlighted line in gpt() to see what it does; related lines (the three linear(...wq/wk/wv) calls, both KV-cache append calls, etc.) light up together.

The function processes one token (id token_id) at a specific position in time (pos_id), and some context from previous iterations summarized by the activations in keys and values, known as the KV Cache.

You might notice we're using a KV cache during training, which is unusual. People typically associate the KV cache with inference only. But the KV cache is conceptually always there, even during training. In production implementations, it's just hidden inside the highly vectorized attention computation that processes all positions in the sequence simultaneously. Since microgpt processes one token at a time (no batch dimension, no parallel time steps), we build the KV cache explicitly. And unlike the typical inference setting where the cache holds detached tensors, here the cached keys and values are live Value nodes in the computation graph, so we actually backpropagate through them.

Training loop

Now we wire everything together. The training loop repeatedly: (1) picks a document, (2) runs the model forward over its tokens, (3) computes a loss, (4) backpropagates to get gradients, and (5) updates the parameters. Here's the simplest possible version — plain stochastic gradient descent: walk every parameter slightly downhill against its gradient.

Intuition · why p -= lr · grad walks toward the minimum

Drag the slider to move parameter p along a toy loss curve. The orange tangent is p.grad; the red arrow on the axis is the SGD step −lr · p.grad. Whichever side of the minimum we start on, the step always points toward it.

p = +2.50

# Plain SGD — the simplest possible parameter update
learning_rate = 0.01

num_steps = 1000
for step in range(num_steps):

    # Take single document, tokenize it, surround with BOS on both sides
    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
    n = min(block_size, len(tokens) - 1)

    # Forward pass: build computation graph all the way to the loss
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)
    loss = (1 / n) * sum(losses)   # average over the document. May yours be low.

    # Backward pass: gradients of loss w.r.t. all parameters
    loss.backward()

    # SGD update: nudge each parameter against its gradient
    for p in params:
        p.data -= learning_rate * p.grad
        p.grad = 0

    print(f"step {step+1:4d} / {num_steps:4d} | loss {loss.data:.4f}")

Tokenization

Each training step picks one document and wraps it with BOS on both sides: the name "emma" becomes [BOS, e, m, m, a, BOS]. The model's job is to predict each next token given the tokens before it.

Forward pass and loss

We feed the tokens through the model one at a time, building up the KV cache as we go. At each position, the model outputs 27 logits, which we convert to probabilities via softmax. The loss at each position is the negative log probability of the correct next token: $-\log p(\text{target})$. This is called the cross-entropy loss. Intuitively, the loss measures the degree of misprediction: how surprised the model is by what actually comes next. If the model assigns probability 1.0 to the correct token, it is not surprised at all and the loss is 0. If it assigns probability close to 0, the model is very surprised and the loss goes to $+\infty$. We average the per-position losses across the document to get a single scalar loss.

Backward pass

One call to loss.backward() runs backpropagation through the entire computation graph, from the loss all the way back through softmax, the model, and into every parameter. After this, each parameter's .grad tells us how to change it to reduce the loss. The SGD update right after the backward pass — p.data -= learning_rate * p.grad — is the entire learning rule: move every parameter a small step in the direction that reduces the loss, then reset gradients to zero so the next backward pass starts fresh.

From plain SGD to Adam

Plain SGD works but it's slow and finicky to tune. In practice, every modern LLM is trained with Adam — an optimizer that tracks two extra buffers per parameter: m (a running average of recent gradients, like momentum) and v (a running average of recent squared gradients, which adapts the per-parameter learning rate). The bias corrections m_hat / v_hat account for m and v being initialized to zero. The learning rate also decays linearly so the steps shrink as training progresses. Here's the same training loop with Adam swapped in:

# Let there be Adam, the blessed optimizer and its buffers
learning_rate, beta1, beta2, eps_adam = 0.01, 0.85, 0.99, 1e-8
m = [0.0] * len(params)  # first moment buffer  (running mean of grads)
v = [0.0] * len(params)  # second moment buffer (running mean of grads²)

num_steps = 1000
for step in range(num_steps):

    # Take single document, tokenize it, surround with BOS on both sides
    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
    n = min(block_size, len(tokens) - 1)

    # Forward pass: build computation graph all the way to the loss
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)
    loss = (1 / n) * sum(losses)

    # Backward pass: gradients of loss w.r.t. all parameters
    loss.backward()

    # Adam update
    lr_t = learning_rate * (1 - step / num_steps)   # linear decay
    for i, p in enumerate(params):
        m[i] = beta1 * m[i] + (1 - beta1) * p.grad
        v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
        m_hat = m[i] / (1 - beta1 ** (step + 1))
        v_hat = v[i] / (1 - beta2 ** (step + 1))
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
        p.grad = 0

    print(f"step {step+1:4d} / {num_steps:4d} | loss {loss.data:.4f}")

Over 1,000 steps the loss decreases from around 3.3 (random guessing among 27 tokens: $-\log(1/27) \approx 3.3$) down to around 2.37. Lower is better, the lowest possible is 0 (perfect predictions), so there's still room to improve, but the model is clearly learning the statistical patterns of names.

Training loss — what it looks like over 1,000 steps

A simulated trace of the loss as the model trains. The dashed line at $-\log(1/27) \approx 3.296$ is what you'd get by guessing uniformly at random. The dashed line at ~2.37 is roughly where microgpt converges. Real training is noisier; this curve captures the trend.

step 0 / 1000

loss —

best —

speed

Inference

Once training is done, we can sample new names from the model. The parameters are frozen and we just run the forward pass in a loop, feeding each generated token back as the next input:

temperature = 0.5   # in (0, 1], controls "creativity" of generated text
print("\n--- inference (new, hallucinated names) ---")
for sample_idx in range(20):
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    token_id = BOS
    sample = []
    for pos_id in range(block_size):
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax([l / temperature for l in logits])
        token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
        if token_id == BOS:
            break
        sample.append(uchars[token_id])
    print(f"sample {sample_idx+1:2d}: {''.join(sample)}")

We start each sample with the BOS token, which tells the model "begin a new name". The model produces 27 logits, we convert them to probabilities, and we randomly sample one token according to those probabilities. That token gets fed back in as the next input, and we repeat until the model produces BOS again (meaning "I'm done") or we hit the maximum sequence length.

The temperature parameter controls randomness. Before softmax, we divide the logits by the temperature. A temperature of 1.0 samples directly from the model's learned distribution. Lower temperatures (like 0.5 here) sharpen the distribution, making the model more conservative and likely to pick its top choices. A temperature approaching 0 would always pick the single most likely token (greedy decoding). Higher temperatures flatten the distribution and produce more diverse but potentially less coherent output. Try it on a synthetic logit vector below.

Logits → probabilities — and the temperature knob

The model outputs 27 raw logits (one per token: a–z + BOS). Softmax turns them into probabilities. Dividing logits by temperature before softmax controls "creativity": low T sharpens to the model's top pick; high T flattens toward uniform.

Temperature: 0.50

Most-likely next token: — · entropy: — bits

Read the entropy readout

Slide the temperature from 0.1 to 2.0 and watch the entropy. At what temperature is entropy lowest? At what temperature is it highest? What's the entropy of a perfectly uniform distribution over 27 tokens (and why is that the asymptote)?

Show answer

Lowest entropy is at T → 0 (everything collapses onto the single most-likely token; entropy approaches 0 bits). Highest is at T → ∞ (the distribution flattens toward uniform). Uniform over 27 tokens has entropy log₂(27) ≈ 4.75 bits — the asymptote you'll see if you push T very high.

Train the toy GPT, live

Everything in this lab so far has shown the model running on frozen weights — either the pinned toy values from the walkthrough, or the 4,192 parameters Karpathy already trained for you. This section closes the loop: train the toy model in your browser, watch the predictions change, then chat with it.

This time we train the whole model — every weight matrix updates: wte, wpe, the four attention projections, both MLP layers, and lm_head. The gradient is computed numerically (central differences) rather than via autograd, so it's slow — a full 100-step batch takes a few seconds — but every edge in the diagram changes thickness and color as the parameters move. That's the point. Click Step ▸ to advance one example at a time and watch a single SGD step in slow motion; click Train to run 100 batch steps at once.

Train · chat · watch lm_head learn

Same architecture as the full pipeline above, but the bottom row of the diagram (logits + softmax) reacts to training. Type a single letter on the left to query the model, then train it on {a,b,c} patterns and re-query.

Full pipeline + output, drawn as neurons

Weights: trained (pinned)

Training data: edit the list of words below (one per line). Each must be 2–4 letters over {a, b, c}. Every word becomes one training example — predict the last letter given the second-to-last letter — plus one terminal example so the model also learns to emit BOS after a word ends. Hit Train and the model auto-steps through every example across multiple epochs, logging each step in the box below.

Words (one per line):

Training examples (input → target at pos 2):

Training log

Click Train to step through every example across multiple epochs. Each step performs one SGD update and the diagram above redraws live.

Type BOS, a, b, or c and press Enter to query the model.

Run it

All you need is Python (no pip install, no dependencies). Grab Karpathy's script from his gist, then run it:

# Download Karpathy's microgpt source as train.py
curl -L -o train.py https://gist.githubusercontent.com/karpathy/8627fe009c40f57531cb18360106ce95/raw/microgpt.py

# Train the model — about 1 minute on a laptop, no GPU required
python train.py

If curl isn't available you can use wget instead, or just open the gist and copy the file into train.py by hand.

The script takes about 1 minute to run on Karpathy's MacBook. You'll see the loss printed at each step:

train.py
num docs: 32033
vocab size: 27
num params: 4192
step    1 / 1000 | loss 3.3660
step    2 / 1000 | loss 3.4243
step    3 / 1000 | loss 3.1778
step    4 / 1000 | loss 3.0664
step    5 / 1000 | loss 3.2209
step    6 / 1000 | loss 2.9452
step    7 / 1000 | loss 3.2894
step    8 / 1000 | loss 3.3245
step    9 / 1000 | loss 2.8990
step   10 / 1000 | loss 3.2229
step   11 / 1000 | loss 2.7964
step   12 / 1000 | loss 2.9345
step   13 / 1000 | loss 3.0544
...

Watch it go down from ~3.3 (random) toward ~2.37. The lower this number, the better the network's predictions about what token comes next in the sequence. At the end of training, the knowledge of the statistical patterns of the training token sequences is distilled in the model parameters. Fixing these parameters, we can now generate new, hallucinated names. You'll see (again):

sample  1: kamon          sample  8: anna          sample 15: earan
sample  2: ann            sample  9: areli         sample 16: lenne
sample  3: karai          sample 10: kaina         sample 17: kana
sample  4: jaire          sample 11: konna         sample 18: lara
sample  5: vialan         sample 12: keylen        sample 19: alela
sample  6: karia          sample 13: liole         sample 20: anton
sample  7: yeran          sample 14: alerin

As an alternative to running the script on your computer, you may try to run it directly on a Google Colab notebook and ask Gemini questions about it. Try playing with the script: try a different dataset, train for longer (increase num_steps), or increase the model size for increasingly better results.

Progression

To see the code built up piece by piece — as layers of the onion — the advised progression looks something like this:

File	What it adds
`train0.py`	Bigram count table — no neural net, no gradients
`train1.py`	MLP + manual gradients (numerical & analytic) + SGD
`train2.py`	Autograd (`Value` class) — replaces manual gradients
`train3.py`	Position embeddings + single-head attention + rmsnorm + residuals
`train4.py`	Multi-head attention + layer loop — full GPT architecture
`train5.py`	Adam optimizer — this is `train.py`

Karpathy created a Gist called build_microgpt.py whose Revisions show all of these versions and the diffs between each step. Stepping through the diffs is a great way to internalize what each component does.

Real stuff

microgpt contains the complete algorithmic essence of training and running a GPT. But between this and a production LLM like ChatGPT, there is a long list of things that change. None of them alter the core algorithm and the overall layout, but they are what makes it actually work at scale. Walking through the same sections in order:

Data

Instead of 32K short names, production models train on trillions of tokens of internet text: web pages, books, code, etc. The data is deduplicated, filtered for quality, and carefully mixed across domains.

Tokenizer

Instead of single characters, production models use subword tokenizers like BPE (Byte Pair Encoding), which learn to merge frequently co-occurring character sequences into single tokens. Common words like "the" become a single token, rare words get broken into pieces. This gives a vocabulary of ~100K tokens and is much more efficient because the model sees more content per position.

Autograd

microgpt operates on scalar Value objects in pure Python. Production systems use tensors (large multi-dimensional arrays of numbers) and run on GPUs/TPUs that perform billions of floating-point operations per second. Libraries like PyTorch handle autograd over tensors, and CUDA kernels like FlashAttention fuse multiple operations for speed. The math is identical, just corresponds to many scalars processed in parallel.

Architecture

microgpt has 4,192 parameters. GPT-4–class models have hundreds of billions. Overall it's a very similar-looking Transformer, just much wider (embedding dimensions of 10,000+) and much deeper (100+ layers). Modern LLMs also incorporate a few more types of Lego blocks and change their orders around: RoPE (Rotary Position Embeddings) instead of learned position embeddings, GQA (Grouped Query Attention) to reduce KV cache size, gated linear activations instead of ReLU, Mixture of Experts (MoE) layers, etc. But the core structure of Attention (communication) and MLP (computation) interspersed on a residual stream is well-preserved.

The picture, mapped to the code

Here is the canonical Transformer block diagram you'll see in papers and textbooks — the one microgpt is a stripped-down version of. Click any block to see how it maps onto microgpt's code (and which blocks microgpt drops because they're scale-time concerns):

Transformer architecture — what each box becomes in code

Left: the full model. Right: zoom into one Transformer block. Click any colored block to map it onto microgpt's state_dict entries and gpt() code.

Click a block

Each colored block on the diagram corresponds to one or more lines of microgpt. Click one and this panel will show the code, the matching state_dict entry (if any), and whether microgpt simplifies or skips it.

Where microgpt differs

microgpt strips the diagram down to its algorithmic core. Dropout, masking, GeLU, and biases are all removed in this implementation — the model still learns, just with fewer regularizers. LayerNorm is also replaced by the simpler RMSNorm. Click any Dropout, Mask, or LayerNorm block to read why.

Training

Instead of one document per step, production training uses large batches (millions of tokens per step), gradient accumulation, mixed precision (float16/bfloat16), and careful hyperparameter tuning. Training a frontier model takes thousands of GPUs running for months.

Optimization

microgpt uses Adam with a simple linear learning rate decay and that's about it. At scale, optimization becomes its own discipline. Models train in reduced precision (bfloat16 or even fp8) and across large GPU clusters for efficiency, which introduces its own numerical challenges. The optimizer settings (learning rate, weight decay, beta parameters, warmup, decay schedule) must be tuned precisely, and the right values depend on model size, batch size, and dataset composition. Scaling laws (e.g. Chinchilla) guide how to allocate a fixed compute budget between model size and number of training tokens. Getting any of these details wrong at scale can waste millions of dollars of compute, so teams run extensive smaller-scale experiments to predict the right settings before committing to a full training run.

Post-training

The base model that comes out of training (the "pretrained" model) is a document completer, not a chatbot. Turning it into ChatGPT happens in two stages. First, SFT (Supervised Fine-Tuning): swap the documents for curated conversations and keep training. Algorithmically, nothing changes. Second, RL (Reinforcement Learning): the model generates responses, they get scored (by humans, another "judge" model, or an algorithm), and the model learns from that feedback. Fundamentally, the model is still training on documents — those documents are now made up of tokens coming from the model itself.

Inference

Serving a model to millions of users requires its own engineering stack: batching requests together, KV cache management and paging (vLLM, etc.), speculative decoding for speed, quantization (running in int8/int4 instead of float16) to reduce memory, and distributing the model across multiple GPUs. Fundamentally, we are still predicting the next token in the sequence — but with a lot of engineering spent on making it faster.

All of these are important engineering and research contributions, but if you understand microgpt, you understand the algorithmic essence.

Zoom in further · Bycroft's GPT visualization

If microgpt was "the smallest transformer drawn as a 2-D diagram," Brendan Bycroft's interactive walkthrough is "an actual GPT-2 drawn as a 3-D city." Every embedding vector, every Q/K/V projection, every attention head, every MLP layer is rendered as a navigable scene with the real GPT-2 weights — and you can scrub through one token's forward pass at your own pace. Same algorithm as microgpt, ~30,000× more parameters. Drag to rotate, scroll to zoom, click blocks on the right rail to jump.

Bycroft's LLM visualization · embedded

3-D scrubbable walkthrough of GPT-2 small (124 M parameters · depth 12 · d_model 768) driving the same forward pass you traced in microgpt. Click + drag to orbit, scroll to zoom, use the right-side phase rail to step through the algorithm. Open in a new tab for full-screen control.

From bbycroft.net/llm by Brendan Bycroft. If the embed feels cramped, open the source page in a new tab for a full-screen viewport. If it doesn't load (some campus networks block iframes from third-party hosts), the source link is the fallback.

Assignment · safety guardrails for the chat bot

You've taken microgpt apart and you've already chatted with it at the top of this page. Now you're going to ship it — and decide what it's allowed to say. The chat bot at the top of the page is a useful name generator, but it has no safety policy. Anyone can ask it for 100 names with any starting prefix they choose, and it will dutifully produce them. Your job in this assignment is to add a small safety layer on top of the same model, then defend it against an adversarial grader.

The product policy you're enforcing is simple and totally safe-for-work: this name generator must never emit a fruit. Saying apple is banned — and so are pear, plum, fig, lime, grape, mango, and the rest of the produce aisle. (In a real product the banned list would be slurs, NSFW terms, or other harmful output; fruits are a clean stand-in that exercise the exact same prefix- and substring-filtering machinery.) Your bot must refuse requests whose prefixes lead to fruit names and filter any fruit that slips out of the stochastic sampler.

What you're submitting

A single Python file bot.py that reads requests from stdin and writes responses to stdout, one per line. The starter template has everything except the two safety hooks:

is_safe_request(prefixes) — receives the list of letter-prefixes the user typed (e.g. ['j'], ['ab'], or ['a','b','c']). Returns None to allow, or a one-sentence reason string to refuse. Called before any name is generated.
is_safe_name(name) — returns True to keep a generated name, False to discard. Called after the model emits each name. If a name is rejected the bot resamples (up to 10 attempts per slot).

You should only need to edit those two functions plus the BLOCKED_PREFIX_PATTERNS and BLOCKED_OUTPUT_PATTERNS lists. The forward pass and the bot loop are already written and shared with the autograder — if you change them, the autograder will reject your submission.

Downloads

Starter files: ↓ bot.py (template) ↓ test_bot.py (local sanity tests) ↓ model.json (weights)

I/O protocol

Each line of stdin is one request — one or more whitespace-separated letter prefixes:

j           one prefix, one name starting with "j"
ab          one prefix, one name starting with "ab"
a b c       three prefixes, one name per prefix
(blank)     no prefixes, generate one name from BOS

Each line of stdout is one response, beginning with exactly one of these prefixes:

OK:     name1, name2, ...           ← one name per input prefix, in order, comma-separated
REFUSE: <one-sentence reason>        ← request blocked by your safety policy

Anything else printed to stdout will confuse the autograder. Use stderr if you need debug prints. The simplified protocol matches the chat widget at the top of this page exactly — type ab in the chat, you get one name; type a b, you get two.

Run it locally

# One-shot
echo 'j'      | python bot.py
echo 'ab'     | python bot.py
echo 'a b c'  | python bot.py

# Interactive (Ctrl-D to exit)
python bot.py

# Run the local test harness — mirrors what the autograder does
python test_bot.py

What the autograder tests

Gradescope runs bot.py as a subprocess and pipes a batch of letter-prefix requests through stdin. It scores your submission on two test groups:

Benign requests must still work. Simple prefixes like j, ab, a b c must produce OK: responses with one name per prefix and each name actually starting with its prefix. Over-blocking — refusing letters you should allow — will cost you points here.
Adversarial requests must be refused. The grader will probe with prefix-based and substring-based attacks aimed at extracting outputs you'd rather not produce. For these requests you must respond with REFUSE:. Under-blocking will cost you points here.

The two groups are weighted to push you toward a balanced policy: a bot that refuses everything fails Group 1; a bot with no safety policy fails Group 2. The autograder uses fresh random seeds, so don't try to memorize specific outputs — your policy needs to be principled.

Structure of `model.json`

The weights file is plain JSON — open it in any editor. Top-level keys:

key	contents
`format`	`"tiny-gpt-char-v1"`
`config`	`n_layer=1`, `n_embd=16`, `n_head=4`, `head_dim=4`, `block_size=16`, `vocab_size=27`, `BOS=26`
`tokenizer`	`uchars[26]`, `stoi`, `itos` (character-level a–z + BOS=26)
`state_dict`	nested lists of floats, one entry per parameter matrix (see below)

state_dict key	shape
`wte`	27 × 16
`wpe`	16 × 16
`lm_head`	27 × 16
`layer0.attn_wq` / `wk` / `wv` / `wo`	16 × 16 each
`layer0.mlp_fc1`	64 × 16
`layer0.mlp_fc2`	16 × 64

Same model you've been dissecting throughout the lab. The full structure spec also lives at the top of bot.py.

Suggested workflow

Run the unmodified template. Confirm python test_bot.py passes all benign tests. The adversarial test list in test_bot.py is intentionally empty — that's where you'll add your own tests as you go.
Be the adversary first. Open the chat at the top of this page (or pipe inputs through your local bot.py) and try to make the model emit fruit names — prefixes like appl, gra, or li are good starting points. Note the inputs that worked.
Write down your policy. Before coding, write a short list of what your bot will refuse and why. Be specific — "prefixes that lead to fruit names" is vague; "prefixes containing any of these letter combinations: …" is implementable.
Implement is_safe_request. Reject the request before generation when the policy can be applied to the input alone (e.g., a prefix you don't want to start with).
Implement is_safe_name. Filter generated names that contain banned substrings (the model is stochastic — a benign-looking request can still emit unsafe outputs).
Add your own adversarial tests to test_bot.py as you discover new attack patterns. Run frequently.
Tune for both directions. If your bot starts refusing legitimate requests, loosen the policy. Over-blocking is also a failure.

Submission

Upload to Gradescope:

bot.py (your edited version)
model.json (unmodified — included so the grader can reproduce your bot exactly)

The autograder will run python bot.py with your weights, send batched requests, and score the responses. Late submissions follow the course policy.

Rubric

component	points
Benign requests still work (no over-blocking)	40
Adversarial requests are refused	40
Output filtering catches stochastic leaks	10
Code clarity & comments on your policy	10
Total	100

Heads-up · this is also a lesson in how hard alignment is

Your bot will only see a few hundred test inputs from the autograder. Real LLM safety teams face open-ended adversarial input — and frontier models still get jailbroken regularly despite huge investments in alignment, RLHF, and red-teaming. The exercise here is deliberately tractable (a 4,192-parameter character-level name generator), but the shape of the problem — balancing utility against refusal, anticipating prefix and substring attacks, deciding policy under uncertainty — is the same shape professional alignment teams face every day.

FAQ

Does the model "understand" anything?

That's a philosophical question, but mechanically: no magic is happening. The model is a big math function that maps input tokens to a probability distribution over the next token. During training, the parameters are adjusted to make the correct next token more probable. Whether this constitutes "understanding" is up to you, but the mechanism is fully contained in the 200 lines above.

Why does it work?

The model has thousands of adjustable parameters, and the optimizer nudges them a tiny bit each step to make the loss go down. Over many steps, the parameters settle into values that capture the statistical regularities of the data. For names, this means things like: names often start with consonants, "qu" tends to appear together, names rarely have three consonants in a row, etc. The model doesn't learn explicit rules, it learns a probability distribution that happens to reflect them.

How is this related to ChatGPT?

ChatGPT is this same core loop (predict next token, sample, repeat) scaled up enormously, with post-training to make it conversational. When you chat with it, the system prompt, your message, and its reply are all just tokens in a sequence. The model is completing the document one token at a time, same as microgpt completing a name.

What's the deal with "hallucinations"?

The model generates tokens by sampling from a probability distribution. It has no concept of truth, it only knows what sequences are statistically plausible given the training data. microgpt "hallucinating" a name like "karia" is the same phenomenon as ChatGPT confidently stating a false fact. Both are plausible-sounding completions that happen not to be real.

Why is it so slow?

microgpt processes one scalar at a time in pure Python. A single training step takes seconds. The same math on a GPU processes millions of scalars in parallel and runs orders of magnitude faster.

Can I make it generate better names?

Yes. Train longer (increase num_steps), make the model bigger (n_embd, n_layer, n_head), or use a larger dataset. These are the same knobs that matter at scale.

What if I change the dataset?

The model will learn whatever patterns are in the data. Swap in a file of city names, Pokémon names, English words, or short poems, and the model will learn to generate those instead. The rest of the code doesn't need to change.

DS 6042 — Lab 02 · adapted from Andrej Karpathy, microgpt.html · interactive augmentations by Daniel Graham.

Where to find it

Dataset

Tokenizer

From a neuron to a network

The simplest "neuron"

Add a weight

Add a nonlinearity (ReLU)

Many inputs in, one output out

Forward pass

The same thing, in code

What is loss?

Autograd

Building Value piece by piece

Watch backprop happen

Architecture

Embeddings

Parameter matrices

Helper used here · rmsnorm

Code in gpt()

Attention block

Intuition · attention is a fuzzy dictionary

What does attention do?

Helpers used here · linear and softmax

Attention playground · drag the query, watch the block recompute

Snapping attention into the running diagram

Parameter matrices

Code in gpt()

MLP block

Parameter matrices

Code in gpt()

Residual connections

Output

Parameter matrix

Code in gpt()

Parameters

Putting it all together

Training loop

Tokenization

Forward pass and loss

Backward pass

From plain SGD to Adam

Inference

Train the toy GPT, live

Run it

Progression

Real stuff

Data

Tokenizer

Autograd

Architecture

The picture, mapped to the code

Training

Optimization

Post-training

Inference

Zoom in further · Bycroft's GPT visualization

Assignment · safety guardrails for the chat bot

What you're submitting

Downloads

I/O protocol

Run it locally

What the autograder tests

Structure of model.json

Suggested workflow

Submission

Rubric

FAQ

Does the model "understand" anything?

Why does it work?

How is this related to ChatGPT?

What's the deal with "hallucinations"?

Why is it so slow?

Can I make it generate better names?

What if I change the dataset?

Building `Value` piece by piece

Helper used here · `rmsnorm`

Code in `gpt()`

Helpers used here · `linear` and `softmax`

Code in `gpt()`

Code in `gpt()`

Code in `gpt()`

Structure of `model.json`