Build Your First LLM from ScratchPart 1 · Section 7 of 9

Step 5: Output Layer

Output layer factory illustration showing vocabulary scoring and selecting the highest probability answer — Every word gets a probability score

After the transformer layers, the model has a rich understanding of "two plus three". But it's still just vectors—we need an actual answer.

The output layer asks: "Given everything I've learned about this input, which word in my vocabulary is most likely to come next?"

It scores every word in the vocabulary:

For input "two plus three", the model scores each possible answer:

"zero"   → 0.1%
"one"    → 0.2%
"two"    → 0.3%
"three"  → 0.5%
"four"   → 2.1%
"five"   → 94.2%  ← highest!
"six"    → 1.8%
"seven"  → 0.4%
...
"plus"   → 0.01%
"minus"  → 0.01%

These percentages are called probabilities. They add up to 100% across all words in the vocabulary. The model is saying: "I'm 94.2% confident the answer is 'five'."

The Math Behind It

How does the model convert vectors into probabilities? Two steps:

1. Linear layer: Multiply the final vector by a weight matrix to get a "score" for each word. Higher score = model thinks this word is more likely.

final_vector (64 numbers) × weight_matrix (64 × 30) = scores (30 numbers)

scores = [-2.1, -1.8, -1.5, -0.9, 0.3, 4.2, 0.1, -0.5, ...]
           zero   one   two  three four  five  six seven

Where does the weight matrix come from? We create it, and the model learns its values during training.

64 = our embedding dimension (the size of each word vector)
30 = our vocabulary size (how many words the model knows)

Initially, this matrix is filled with random numbers. During training, these numbers get adjusted so that the correct answer gets the highest score. This is what "learning" means—the model is tuning these numbers to give better predictions.

2. Softmax: Convert raw scores into probabilities (0-100%) that sum to 100%. The formula is:

probability(word) = e^(score for word) / sum of e^(all scores)

For "five" with score 4.2:
  e^4.2 = 66.7
  sum of all e^scores = 70.8
  probability = 66.7 / 70.8 = 94.2%

Softmax has a useful property: it makes high scores much higher and low scores much lower. A score of 4.2 vs 0.3 becomes 94.2% vs 2.1%. This makes the model "confident" in its best guess.

Don't worry if the math feels abstract—we'll implement it step by step in Part 4. The key idea: scores → softmax → probabilities.

Helpful?

Step 4: Transformer Layers (Attention)Step 6: Generation