Build Your First LLM from ScratchPart 4 · Section 1 of 7

What We'll Build

In Part 3, we created embeddings—vectors that represent each token. But there's a problem: each token is isolated. The word "plus" doesn't know it sits between "two" and "three".

Attention solves this by letting each token "look at" every other token and gather relevant information.

Embeddings from Part 3
       ↓
   [Self-Attention]
       ↓
   Each token now "knows about" other tokens
       ↓
   [Multi-Head Attention]
       ↓
   Multiple perspectives combined
       ↓
   Ready for Transformer Block (Part 5)

The Problem Attention Solves

After Part 3, we have embeddings for "two plus three":

"two"   → [0.23, -0.45, ...]  (64 numbers)
"plus"  → [0.67, 0.12, ...]   (64 numbers)
"three" → [0.89, -0.34, ...]  (64 numbers)

These vectors are isolated. "plus" doesn't know it sits between "two" and "three". Attention lets each token gather information from other tokens.

Sections Overview

SectionWhat We BuildAt Scale
4.1Intuition: What is attention?Same concept
4.2Query, Key, ValueSame, larger matrices
4.3Attention scoresScaled dot-product
4.4Single-head attentionSame pattern
4.5Multi-head attention96+ heads in GPT-4
4.6Masked attentionCausal masking
4.7Complete attention moduleSame pattern
Helpful?