Build Your First LLM from ScratchPart 3 · Section 1 of 13
What We'll Build
In this part, we'll build the complete input pipeline that converts text into vectors the transformer can process:
| Stage | Output |
|---|---|
| Input | "two plus three" |
| ↓ Tokenizer | [5, 31, 6] |
| ↓ Embedding | 3 vectors of 64 numbers each |
| ↓ + Positions | 3 position-aware vectors |
| Ready for Part 4 | Transformer input |
Here's what we'll build and how it compares to production models:
| Section | What We Do | At Scale (GPT-4, LLaMA) |
|---|---|---|
| Vocabulary | ~36 tokens (manual) | ~100K tokens using BPE |
| Tokenizer | Word split + lookup | Subword tokenization |
| Embeddings | 64 dimensions | 12,288 dimensions |
| Positions | Learned embeddings | RoPE |
Helpful?