Build Your First LLM from ScratchPart 3 · Section 6 of 13

How PyTorch Populates Parameters

So you've chosen embed_dim=64. PyTorch creates a 36×64 table of random numbers. But how do those random numbers become meaningful embeddings? Here's the step-by-step:

# Step 1: Initialize with random numbers
embedding = nn.Embedding(36, 64)
# PyTorch creates a 36×64 table of random values like:
# "two":   [0.23, -0.45, 0.12, ...]  ← random, meaningless
# "three": [0.89, 0.34, -0.56, ...]  ← random, meaningless
# "five":  [0.12, 0.78, 0.45, ...]   ← random, meaningless

# Step 2: Training loop (this is where the magic happens)
for epoch in range(1000):
    for input, expected_answer in training_data:
        # Forward pass: model makes a prediction
        prediction = model(input)           # "two plus three" → some vector

        # Compute loss: how wrong was the prediction?
        loss = loss_fn(prediction, expected_answer)  # Compare to "five"

        # Backward pass: compute gradients
        loss.backward()
        # PyTorch asks: "Which numbers in the embedding table
        # caused the most error? How should each one change?"

        # Update: nudge each number in the direction that reduces error
        optimizer.step()
        # "two" embedding: [0.23, -0.45, ...] → [0.25, -0.44, ...]
        # Small nudge based on gradient

# After 1000s of examples:
# "two":   [0.82, 0.45, -0.33, ...]  ← now meaningful!
# "three": [0.79, 0.51, -0.28, ...]  ← similar to "two" (both are numbers)
# "five":  [0.85, 0.48, -0.31, ...]  ← close to two+three's result

The Key Insight

Every time the model gets "two plus three" wrong, PyTorch traces back through all the numbers that contributed to that wrong answer—including the 64 numbers representing "two", the 64 for "plus", and the 64 for "three". It then nudges each number slightly in the direction that would have made the answer more correct.

After seeing thousands of examples:

  • Numbers like "two" and "three" end up with similar embeddings (they behave similarly in math)
  • Operators like "plus" and "minus" become distinct from numbers
  • The relationship "two + three = five" gets encoded in how these vectors interact
This is why we call it "learning"—the model discovers that certain numbers should be similar based purely on how they're used, not because we told it "two and three are both numbers."
Helpful?