Build Your First LLM from ScratchPart 3 · Section 2 of 13
Building the Vocabulary
A vocabulary is simply a mapping from words to numbers. Every word the model knows gets a unique ID.
Our Vocabulary
For our calculator, we manually list all ~36 words:
vocabulary = {
# Special tokens
"[PAD]": 0, # Padding for batch processing
"[START]": 1, # Start of sequence
"[END]": 2, # End of sequence
# Numbers 0-19
"zero": 3, "one": 4, "two": 5, "three": 6, "four": 7,
"five": 8, "six": 9, "seven": 10, "eight": 11, "nine": 12,
"ten": 13, "eleven": 14, "twelve": 15, "thirteen": 16,
"fourteen": 17, "fifteen": 18, "sixteen": 19, "seventeen": 20,
"eighteen": 21, "nineteen": 22,
# Tens
"twenty": 23, "thirty": 24, "forty": 25, "fifty": 26,
"sixty": 27, "seventy": 28, "eighty": 29, "ninety": 30,
# Operations
"plus": 31, "minus": 32, "times": 33, "divided": 34, "by": 35,
}Each word becomes its ID: "two" → 5, "plus" → 31, "three" → 6
Special tokens: [START] and [END] mark sequence boundaries—our tokenizer adds these automatically. The model learns that [END] means "stop generating." [PAD] fills shorter sequences when batching.
At Scale: BPE Tokenization
Real models like GPT-4 and LLaMA use Byte Pair Encoding (BPE) to automatically build vocabularies of ~100,000 tokens:
- Words split into subwords: "unhappiness" → ["un", "happi", "ness"]
- Handles any word, any language, even misspellings
- Vocabulary learned from training corpus, not manually created
- Tools:
sentencepiece,tiktoken, Hugging Facetokenizers
Why subwords? With word-level tokens, "running" and "runs" are completely different. With subwords, both share "run" and the model learns they're related.
Helpful?