Build Your First LLM from ScratchPart 1 · Section 1 of 9

What is an LLM?

Large because it has billions of parameters (the numbers the model adjusts during training to get better at predictions). Language because it works with text. Model because it's a mathematical function that learns patterns.

An LLM (Large Language Model) does one thing: it predicts the next word.

That's it. Everything else—chat, code generation, reasoning, translation—emerges from this single capability.

Important: The model doesn't "eat" the input and "spit out" an answer. It appends its prediction to the input. So "two plus three" becomes "two plus three five". This is called autoregressive generation—each new word is added to the sequence, then used to predict the next one.

The Core Insight

When you ask an LLM "What is 2+2?", it doesn't "think" or "calculate". It predicts that after the sequence of words "What is 2+2?", the most likely next words are "2+2 equals 4" or simply "4".

It learned this by reading billions of examples where questions were followed by answers.

Common Misconceptions

Misconception	Reality
"It understands"	It predicts patterns
"It thinks"	It does matrix math
"It knows things"	It learned statistical relationships
"It remembers our chat"	Each input is processed fresh*

*Chat history is included in each new prompt, giving the illusion of memory.

The Analogy

Think of your phone's autocomplete, but trained on the entire internet and scaled up a million times. When you type "How are", your phone suggests "you" because that pattern appears frequently. An LLM does the same thing, just with much longer contexts and much more sophisticated pattern matching.

A Deeper Analogy: The Highway Observer

A person on a balcony observing a highway with vehicles approaching a speed breaker in various weather conditions — Five years of observation, compressed into patterns

Imagine someone watching a highway from their balcony for 5 years — observing vehicles approach a speed breaker in day, night, rain, fog, sparse traffic, dense traffic. They see vehicles slow down, skid, have accidents, trigger chain reactions. After 5 years, they can predict what any new vehicle will do.

How? They haven't stored 5 years of video in their brain. Instead:

Tokens: They learned to see scenes as combinations of meaningful units — truck, sedan, fast, slow, rain, fog, dense traffic.
Embeddings: Each unit carries rich associations — "truck" means heavy, slow to brake, professional driver, creates blind spots.
Attention: For each new vehicle, they focus on what matters — a motorcycle in fog? Watch speed and visibility. A truck in clear weather? Focus on braking distance.
Weights: They've compressed millions of observations into patterns like "trucks + rain + late braking → high skid risk."
Generation: They combine these patterns to predict: "This sedan will slow gradually, 85% smooth passage."

The observer couldn't explain why they know a bus in fog is dangerous — the knowledge is distributed across learned patterns, activated through attention, expressed as predictions. This is exactly how LLMs work.

Key takeaway: An LLM is a very sophisticated pattern-matching machine that predicts what text should come next.

Helpful?

What We're Building: A Text Calculator