Build Your First LLM from ScratchPart 2 · Section 3 of 7

The Dataset

Unlike real LLMs that train on internet text (Wikipedia, books, websites), we'll generate our own dataset programmatically. This is one of the beauties of our calculator project—we don't need to find or download any corpus.

# We write code to generate training examples:
import random

def generate_example():
    a = random.randint(0, 99)  # e.g., 23
    b = random.randint(0, 99)  # e.g., 15
    op = "plus"
    result = a + b             # 38

    # Convert to words (we'll write this helper function):
    # to_words(23) → "twenty three"
    # to_words(38) → "thirty eight"
    return to_words(a) + " plus " + to_words(b), to_words(result)

# Generate 1000 examples in seconds!
Why this matters: Real LLM training requires terabytes of carefully curated text data. Our approach lets us create unlimited, perfectly labeled training examples instantly.

Bonus insight: Our random.randint(0, 99) creates a uniform distribution—every number appears equally. Real language follows a Zipfian (power law) distribution: "1+1" appears millions of times while "97+6" appears once. Our uniform data actually makes training easier and prevents the model from memorizing common cases.

Our Vocabulary

CategoryTokens
Numbers 0-19zero, one, two, ... nineteen
Tenstwenty, thirty, forty, ... ninety
Operationsplus, minus, times, divided, by
Special[PAD], [START], [END]

That's roughly 30 tokens total. Compare this to GPT-4's ~100,000 tokens!

Tokenization choice: We'll use whole-word tokenization (splitting by spaces). Real LLMs use sub-word tokenization (BPE/WordPiece), where "ninety" might split into "nine" + "##ty". Our approach is simpler and works perfectly for our fixed vocabulary.

We'll generate 500-1000 examples covering addition, subtraction, multiplication, and division with numbers 0-99.

Helpful?