How Tokens Decide Who to Attend To
In the previous section, we saw that "plus" attends to "two" and "three" with weights like 0.45 each. But how does "plus" know to look at those tokens? That's where Q, K, V come in—they're the mechanism that computes those attention weights.
The Core Idea
Here's the mechanism: each token's embedding gets transformed into three different vectors—Query, Key, and Value. Then we use simple math (dot products) to figure out which tokens should share information.
For "two plus three", each token's embedding is transformed by 3 weight matrices:
Token Embedding ×W_q ×W_k ×W_v
───── ───────── ──── ──── ────
"two" [0.8, 0.1, ...] → Query_two Key_two Value_two
"plus" [0.1, 0.9, ...] → Query_plus Key_plus Value_plus
"three" [0.7, 0.2, ...] → Query_three Key_three Value_three
Same W_q, W_k, W_v matrices for all tokens.
Different embeddings in → different Q, K, V out.Now the "matching": we compare each Query against all Keys using a dot product:
Query_plus · Key_two = 2.1 (high similarity → attend!)
Query_plus · Key_plus = 0.3 (low similarity → ignore)
Query_plus · Key_three = 2.0 (high similarity → attend!)That's it—no magic "querying" system. Just: if Query and Key vectors point in similar directions, the dot product is high, so attention is high.
What Do Q, K, V Represent?
| Vector | What it encodes | For "plus" in our calculator |
|---|---|---|
| Query (Q) | What kind of information this token needs | "I need number tokens to add" |
| Key (K) | What kind of information this token has | "I'm an addition operation" |
| Value (V) | The actual content to share | "Here's my addition semantics" |
Each embedding gets split into these three roles:
"plus" embedding: [0.67, 0.12, ...]
|
×W_q ×W_k ×W_v
| | |
v v v
Query Key Value
[needs numbers] [I'm an op] [add stuff]But How Does "plus" Know What to Ask?
It doesn't—at least not at first! The model learns what questions to ask through training. Here's how:
Each token's Query and Key come from learned weight matrices:
Q = embedding × W_q (W_q transforms embeddings into queries)
K = embedding × W_k (W_k transforms embeddings into keys)
V = embedding × W_v (W_v transforms embeddings into values)
Initially, these matrices are random—"plus" asks random questions!
But during training:
- Model sees "two plus three" → predicts "seven" (wrong!)
- Correct answer is "five"
- Model adjusts W_q so "plus" learns to ask for numbers
- Model adjusts W_k so numbers learn to advertise "I'm a number"
- After thousands of examples, the matrices encode useful patternsThe Complete Flow
Once trained, here's how attention computes those weights we saw earlier:
Step 1: Transform embeddings → Q, K, V (using learned W_q, W_k, W_v)
Q_plus = embedding_plus × W_q (learns to ask for operands)
K_two = embedding_two × W_k (learns to advertise "I'm a number")
V_two = embedding_two × W_v (carries the actual number info)
Step 2: Compare "plus" Query to all Keys (dot product):
score("plus" → "two") = Q_plus · K_two = high (match!)
score("plus" → "plus") = Q_plus · K_plus = low (no match)
score("plus" → "three") = Q_plus · K_three = high (match!)
Step 3: Convert scores to weights (softmax):
weight("two") = 0.45
weight("plus") = 0.10
weight("three") = 0.45
Step 4: Gather weighted sum of Values:
output = 0.45 × V_two + 0.10 × V_plus + 0.45 × V_three
→ "plus" now knows it's adding 2 and 3!This is exactly what we experimented with in the Attention Game—but now you know how those weights are computed!
Visualizing Attention: The Attention Matrix
When we compute attention for all tokens at once, we get a matrix showing who attends to whom:
Keys (what each token offers)
"two" "plus" "three"
+-------+-------+-------+
| | | |
"two" | 0.8 | 0.1 | 0.1 | ← "two" mostly looks at itself
+-------+-------+-------+
Queries | | | |
"plus" | 0.45 | 0.10 | 0.45 | ← "plus" looks at both numbers!
+-------+-------+-------+
| | | |
"three" | 0.1 | 0.1 | 0.8 | ← "three" mostly looks at itself
+-------+-------+-------+
Each row sums to 1.0 (it's a probability distribution)The middle row is the key insight: "plus" distributes its attention equally between the two numbers. This is what enables it to compute the sum.