Step 5: Output Layer

After the transformer layers, the model has a rich understanding of "two plus three". But it's still just vectors—we need an actual answer.
The output layer asks: "Given everything I've learned about this input, which word in my vocabulary is most likely to come next?"
It scores every word in the vocabulary:
For input "two plus three", the model scores each possible answer:
"zero" → 0.1%
"one" → 0.2%
"two" → 0.3%
"three" → 0.5%
"four" → 2.1%
"five" → 94.2% ← highest!
"six" → 1.8%
"seven" → 0.4%
...
"plus" → 0.01%
"minus" → 0.01%These percentages are called probabilities. They add up to 100% across all words in the vocabulary. The model is saying: "I'm 94.2% confident the answer is 'five'."
The Math Behind It
How does the model convert vectors into probabilities? Two steps:
1. Linear layer: Multiply the final vector by a weight matrix to get a "score" for each word. Higher score = model thinks this word is more likely.
final_vector (64 numbers) × weight_matrix (64 × 30) = scores (30 numbers)
scores = [-2.1, -1.8, -1.5, -0.9, 0.3, 4.2, 0.1, -0.5, ...]
zero one two three four five six sevenWhere does the weight matrix come from? We create it, and the model learns its values during training.
- 64 = our embedding dimension (the size of each word vector)
- 30 = our vocabulary size (how many words the model knows)
2. Softmax: Convert raw scores into probabilities (0-100%) that sum to 100%. The formula is:
probability(word) = e^(score for word) / sum of e^(all scores)
For "five" with score 4.2:
e^4.2 = 66.7
sum of all e^scores = 70.8
probability = 66.7 / 70.8 = 94.2%Softmax has a useful property: it makes high scores much higher and low scores much lower. A score of 4.2 vs 0.3 becomes 94.2% vs 2.1%. This makes the model "confident" in its best guess.