Build Your First LLM from ScratchPart 3 · Section 1 of 13

What We'll Build

In this part, we'll build the complete input pipeline that converts text into vectors the transformer can process:

StageOutput
Input"two plus three"
↓ Tokenizer[5, 31, 6]
↓ Embedding3 vectors of 64 numbers each
↓ + Positions3 position-aware vectors
Ready for Part 4Transformer input

Here's what we'll build and how it compares to production models:

SectionWhat We DoAt Scale (GPT-4, LLaMA)
Vocabulary~36 tokens (manual)~100K tokens using BPE
TokenizerWord split + lookupSubword tokenization
Embeddings64 dimensions12,288 dimensions
PositionsLearned embeddingsRoPE
Helpful?