Siddhartha Lahiri

Build Your First LLM from ScratchPart 3 · Section 1 of 13

What We'll Build

In this part, we'll build the complete input pipeline that converts text into vectors the transformer can process:

Stage	Output
Input	"two plus three"
↓ Tokenizer	[5, 31, 6]
↓ Embedding	3 vectors of 64 numbers each
↓ + Positions	3 position-aware vectors
Ready for Part 4	Transformer input

Here's what we'll build and how it compares to production models:

Section	What We Do	At Scale (GPT-4, LLaMA)
Vocabulary	~36 tokens (manual)	~100K tokens using BPE
Tokenizer	Word split + lookup	Subword tokenization
Embeddings	64 dimensions	12,288 dimensions
Positions	Learned embeddings	RoPE

Helpful?

Part 2: End-to-End Preview Building the Vocabulary