Build Your First LLM from ScratchPart 1 · Section 6 of 9

Step 4: Transformer Layers (Attention)

Attention factory illustration showing tokens communicating with each other to share context — Every token learns from every other token

Imagine you're reading a sentence and someone asks you what it means. You don't look at each word separately—you naturally consider how words relate to each other.

Right now, our model has three separate vectors for "two", "plus", and "three". But these vectors are like three people in separate rooms—they can't talk to each other. To solve "two plus three", the model needs to understand the relationship between these words.

This is what "attention" does. Think of it like a group discussion:

Robot tokens in a meeting room discussing their relationships — Tokens talking to each other to understand context

Imagine the words sitting in a meeting room:

"plus": Hey everyone, I'm an operation. Who am I working with?
"two":  I'm a number! I'm sitting before you.
"three": I'm also a number! I'm sitting after you.
"plus": Got it—I need to ADD "two" and "three" together.

In technical terms, each word asks a question ("what's relevant to me?") and every other word offers an answer ("here's my information"). The model assigns an attention weight to each pair—a score from 0 to 1 indicating importance.

When "plus" looks at the other words, it assigns high weights (0.8) to "two" and "three" because they're relevant, and low weights (0.1) to irrelevant words. These weights determine how much each word influences the final understanding.

After this "discussion", each word's vector gets updated with information from the others:

Word	Before attention	After attention
"two"	I'm the number 2	I'm the number 2, AND I'm being added to something
"plus"	I'm an addition operation	I'm adding the number before me to the number after me
"three"	I'm the number 3	I'm the number 3, AND I'm being added to something

The transformer repeats this "discussion" through multiple layers (we'll use 2-4). Each round of discussion refines the understanding:

Layer 1: Basic relationships ("plus connects two numbers")
Layer 2: Deeper understanding ("we're computing 2 + 3")
Layer 3: Final insight ("the answer should be 5")

Important: We don't manually program these relationships! We don't write code that says "when you see 'plus', look at the numbers around it". Instead, we just define the structure (words can attend to other words), and the model learns what to pay attention to by seeing thousands of examples during training.

Here's a simplified view of how learning works:

Robot learning through trial and error, progressing from wrong guesses to correct answers — Learning through practice — from confused to confident

Training example #1:
  Input: "one plus two"  →  Model guesses: "seven" (wrong!)
  Correct answer: "three"
  Model adjusts: "Hmm, I should pay more attention to 'one' and 'two' when I see 'plus'"

Training example #2:
  Input: "four plus three"  →  Model guesses: "six" (closer!)
  Correct answer: "seven"
  Model adjusts: "I'm getting better at addition, but need more practice"

...after 1000 examples...

Training example #1000:
  Input: "five plus one"  →  Model guesses: "six" (correct!)
  Model has learned: when "plus" appears, add the numbers around it

Each wrong answer nudges the model's internal numbers (weights) slightly. After thousands of nudges, the model has "learned" that 'plus' means addition—without us ever explicitly programming that rule.

It's like teaching a child math: you don't explain the neural pathways in their brain—you just show them "2 + 3 = 5" enough times, and their brain figures out the patterns.

Attention is the key innovation that makes transformers so powerful. We'll build it from scratch in Part 4. For now, just remember: it lets words "talk" to each other and share information—and the model learns what's worth talking about.

Helpful?

Step 3: Positional Encoding Step 5: Output Layer