Step 4: Transformer Layers (Attention)

Imagine you're reading a sentence and someone asks you what it means. You don't look at each word separately—you naturally consider how words relate to each other.
Right now, our model has three separate vectors for "two", "plus", and "three". But these vectors are like three people in separate rooms—they can't talk to each other. To solve "two plus three", the model needs to understand the relationship between these words.
This is what "attention" does. Think of it like a group discussion:

Imagine the words sitting in a meeting room:
"plus": Hey everyone, I'm an operation. Who am I working with?
"two": I'm a number! I'm sitting before you.
"three": I'm also a number! I'm sitting after you.
"plus": Got it—I need to ADD "two" and "three" together.In technical terms, each word asks a question ("what's relevant to me?") and every other word offers an answer ("here's my information"). The model assigns an attention weight to each pair—a score from 0 to 1 indicating importance.
When "plus" looks at the other words, it assigns high weights (0.8) to "two" and "three" because they're relevant, and low weights (0.1) to irrelevant words. These weights determine how much each word influences the final understanding.
After this "discussion", each word's vector gets updated with information from the others:
| Word | Before attention | After attention |
|---|---|---|
| "two" | I'm the number 2 | I'm the number 2, AND I'm being added to something |
| "plus" | I'm an addition operation | I'm adding the number before me to the number after me |
| "three" | I'm the number 3 | I'm the number 3, AND I'm being added to something |
The transformer repeats this "discussion" through multiple layers (we'll use 2-4). Each round of discussion refines the understanding:
- Layer 1: Basic relationships ("plus connects two numbers")
- Layer 2: Deeper understanding ("we're computing 2 + 3")
- Layer 3: Final insight ("the answer should be 5")
Here's a simplified view of how learning works:

Training example #1:
Input: "one plus two" → Model guesses: "seven" (wrong!)
Correct answer: "three"
Model adjusts: "Hmm, I should pay more attention to 'one' and 'two' when I see 'plus'"
Training example #2:
Input: "four plus three" → Model guesses: "six" (closer!)
Correct answer: "seven"
Model adjusts: "I'm getting better at addition, but need more practice"
...after 1000 examples...
Training example #1000:
Input: "five plus one" → Model guesses: "six" (correct!)
Model has learned: when "plus" appears, add the numbers around itEach wrong answer nudges the model's internal numbers (weights) slightly. After thousands of nudges, the model has "learned" that 'plus' means addition—without us ever explicitly programming that rule.
It's like teaching a child math: you don't explain the neural pathways in their brain—you just show them "2 + 3 = 5" enough times, and their brain figures out the patterns.