Build Your First LLM from ScratchPart 4 · Section 3 of 7

How Tokens Decide Who to Attend To

In the previous section, we saw that "plus" attends to "two" and "three" with weights like 0.45 each. But how does "plus" know to look at those tokens? That's where Q, K, V come in—they're the mechanism that computes those attention weights.

The Core Idea

Here's the mechanism: each token's embedding gets transformed into three different vectors—Query, Key, and Value. Then we use simple math (dot products) to figure out which tokens should share information.

Analogy: Three pairs of glasses. Imagine each token puts on three different pairs of glasses. Through the "Query glasses" (W_q), a token sees what it needs. Through the "Key glasses" (W_k), it shows what it offers. Through the "Value glasses" (W_v), it reveals its content. Same glasses for everyone—but each token sees something different based on who they are.

For "two plus three", each token's embedding is transformed by 3 weight matrices:

Token      Embedding           ×W_q            ×W_k             ×W_v
─────      ─────────           ────            ────             ────
"two"      [0.8, 0.1, ...]  → Query_two       Key_two          Value_two
"plus"     [0.1, 0.9, ...]  → Query_plus      Key_plus         Value_plus
"three"    [0.7, 0.2, ...]  → Query_three     Key_three        Value_three

Same W_q, W_k, W_v matrices for all tokens.
Different embeddings in → different Q, K, V out.

Now the "matching": we compare each Query against all Keys using a dot product:

Query_plus · Key_two   = 2.1  (high similarity → attend!)
Query_plus · Key_plus  = 0.3  (low similarity → ignore)
Query_plus · Key_three = 2.0  (high similarity → attend!)

That's it—no magic "querying" system. Just: if Query and Key vectors point in similar directions, the dot product is high, so attention is high.

What Do Q, K, V Represent?

Vector	What it encodes	For "plus" in our calculator
Query (Q)	What kind of information this token needs	"I need number tokens to add"
Key (K)	What kind of information this token has	"I'm an addition operation"
Value (V)	The actual content to share	"Here's my addition semantics"

Each embedding gets split into these three roles:

"plus" embedding: [0.67, 0.12, ...]
                         |
         ×W_q           ×W_k           ×W_v
           |              |              |
           v              v              v
        Query           Key           Value
   [needs numbers]  [I'm an op]   [add stuff]

But How Does "plus" Know What to Ask?

It doesn't—at least not at first! The model learns what questions to ask through training. Here's how:

Each token's Query and Key come from learned weight matrices:

Q = embedding × W_q    (W_q transforms embeddings into queries)
K = embedding × W_k    (W_k transforms embeddings into keys)
V = embedding × W_v    (W_v transforms embeddings into values)

Initially, these matrices are random—"plus" asks random questions!

But during training:
- Model sees "two plus three" → predicts "seven" (wrong!)
- Correct answer is "five"
- Model adjusts W_q so "plus" learns to ask for numbers
- Model adjusts W_k so numbers learn to advertise "I'm a number"
- After thousands of examples, the matrices encode useful patterns

Key insight: We don't program "plus should look for numbers." We just define the structure (Q, K, V matrices) and let training discover what's useful. The model figures out that operation tokens should attend to number tokens.

The Complete Flow

Once trained, here's how attention computes those weights we saw earlier:

Step 1: Transform embeddings → Q, K, V (using learned W_q, W_k, W_v)
        Q_plus = embedding_plus × W_q  (learns to ask for operands)
        K_two  = embedding_two × W_k   (learns to advertise "I'm a number")
        V_two  = embedding_two × W_v   (carries the actual number info)

Step 2: Compare "plus" Query to all Keys (dot product):
        score("plus" → "two")   = Q_plus · K_two   = high (match!)
        score("plus" → "plus")  = Q_plus · K_plus  = low  (no match)
        score("plus" → "three") = Q_plus · K_three = high (match!)

Step 3: Convert scores to weights (softmax):
        weight("two")   = 0.45
        weight("plus")  = 0.10
        weight("three") = 0.45

Step 4: Gather weighted sum of Values:
        output = 0.45 × V_two + 0.10 × V_plus + 0.45 × V_three
        → "plus" now knows it's adding 2 and 3!

This is exactly what we experimented with in the Attention Game—but now you know how those weights are computed!

Visualizing Attention: The Attention Matrix

When we compute attention for all tokens at once, we get a matrix showing who attends to whom:

                    Keys (what each token offers)
                    "two"   "plus"  "three"
                  +-------+-------+-------+
                  |       |       |       |
      "two"       |  0.8  |  0.1  |  0.1  |  ← "two" mostly looks at itself
                  +-------+-------+-------+
Queries           |       |       |       |
      "plus"      |  0.45 |  0.10 |  0.45 |  ← "plus" looks at both numbers!
                  +-------+-------+-------+
                  |       |       |       |
      "three"     |  0.1  |  0.1  |  0.8  |  ← "three" mostly looks at itself
                  +-------+-------+-------+

Each row sums to 1.0 (it's a probability distribution)

The middle row is the key insight: "plus" distributes its attention equally between the two numbers. This is what enables it to compute the sum.

Helpful?

Intuition - What is Attention?Putting It Together: Single-Head Attention