Build Your First LLM from ScratchPart 4 · Section 1 of 7

What We'll Build

In Part 3, we created embeddings—vectors that represent each token. But there's a problem: each token is isolated. The word "plus" doesn't know it sits between "two" and "three".

Attention solves this by letting each token "look at" every other token and gather relevant information.

Embeddings from Part 3
       ↓
   [Self-Attention]
       ↓
   Each token now "knows about" other tokens
       ↓
   [Multi-Head Attention]
       ↓
   Multiple perspectives combined
       ↓
   Ready for Transformer Block (Part 5)

The Problem Attention Solves

After Part 3, we have embeddings for "two plus three":

"two"   → [0.23, -0.45, ...]  (64 numbers)
"plus"  → [0.67, 0.12, ...]   (64 numbers)
"three" → [0.89, -0.34, ...]  (64 numbers)

These vectors are isolated. "plus" doesn't know it sits between "two" and "three". Attention lets each token gather information from other tokens.

Sections Overview

Section	What We Build	At Scale
4.1	Intuition: What is attention?	Same concept
4.2	Query, Key, Value	Same, larger matrices
4.3	Attention scores	Scaled dot-product
4.4	Single-head attention	Same pattern
4.5	Multi-head attention	96+ heads in GPT-4
4.6	Masked attention	Causal masking
4.7	Complete attention module	Same pattern

Helpful?

Part 3: Summary Intuition - What is Attention?