Build Your First LLM from ScratchPart 4 · Section 2 of 7

Intuition - What is Attention?

The Meeting Room Analogy

Imagine a meeting room where tokens are people:

"two", "plus", and "three" are sitting in a meeting.

"plus" asks: "Who should I pay attention to?"

"two" raises hand:  "I'm a number, sitting before you!"
"three" raises hand: "I'm also a number, sitting after you!"

"plus" decides: "I'll pay 50% attention to 'two' and 50% to 'three'"

Now "plus" knows: "I'm adding the number before me to the number after me"

What Attention Computes

For each token, attention answers: "How relevant is every other token to me?"

Let's see exactly what happens when our calculator processes "two plus three" during training:

Input: "two plus three"
Tokens: ["two", "plus", "three"]

STEP 1: Each token asks "Who should I pay attention to?"
---------------------------------------------------------

Token "two" computes attention weights:
  "two"   → 0.70  (I need to know what number I am)
  "plus"  → 0.20  (The operation affects my role)
  "three" → 0.10  (Less relevant to understanding myself)

Token "plus" computes attention weights:
  "two"   → 0.45  (I need the first operand!)
  "plus"  → 0.10  (I already know I'm an addition)
  "three" → 0.45  (I need the second operand!)

Token "three" computes attention weights:
  "two"   → 0.15  (The other number in the equation)
  "plus"  → 0.25  (The operation I'm part of)
  "three" → 0.60  (I need to know what number I am)


STEP 2: Weighted combination updates each embedding
---------------------------------------------------

Before attention:
  "two"   = [0.8, 0.1, ...]  (just knows "I'm the number 2")
  "plus"  = [0.1, 0.9, ...]  (just knows "I'm addition")
  "three" = [0.7, 0.2, ...]  (just knows "I'm the number 3")

After attention:
  "two"   = [0.75, 0.3, ...]  (knows "I'm 2, being added to something")
  "plus"  = [0.5, 0.5, ...]   (knows "I'm adding 2 and 3")  ← KEY!
  "three" = [0.65, 0.4, ...]  (knows "I'm 3, being added to something")

Why "plus" matters most: Notice how "plus" gathers information from both numbers equally (0.45 each). After attention, the "plus" embedding now encodes the complete operation "2 + 3". This is why the final prediction often comes from the operation token—it has collected all the information needed to compute the answer.

Another Example: "five minus one"

Token "minus" computes attention weights:
  "five"  → 0.50  (I need the number I'm subtracting FROM)
  "minus" → 0.05  (I know I'm subtraction)
  "one"   → 0.45  (I need the number I'm subtracting)

After attention, "minus" embedding contains:
  - Information about 5 (the minuend)
  - Information about 1 (the subtrahend)
  - Its own subtraction semantics
  → Ready to predict "four"!

The key insight: attention lets each token gather exactly the information it needs from the other tokens. The model learns these attention patterns during training—we don't program them!

Our Model vs. At Scale

Aspect	Our Calculator	GPT-4 Scale
Tokens per input	3-5 tokens	Thousands of tokens
Patterns learned	Operations look at numbers	Complex: pronouns→nouns, verbs→subjects, questions→context
Mechanism	Same	Same, just more tokens and dimensions

Key insight: We don't program these patterns—the model learns them during training!

Helpful?

What We'll Build How Tokens Decide Who to Attend To