Back to OverviewPart 2 of 8

Part 2: The Project

Understand why we're building a calculator and what the final result looks like

Why a Calculator?

We could teach LLM concepts with any task—chatbot, code generation, translation. But a calculator is perfect for learning. Here's why:

CriteriaCalculatorText-to-SQLChatbot
Vocabulary~30 words~10,000~50,000
Training time5-10 min2-3 hoursDays
Data generationTrivialNeed datasetComplex
Verify correctnessEasyMediumHard
Key insight: Same concepts, 100x faster iteration. You'll learn tokenization, attention, and transformers—just on a smaller scale.

You might want to jump straight to code generation or chat. But:

  1. Complexity hides understanding — With a complex task, you can't tell if issues are from your model or your data
  2. Training time kills iteration — Real LLMs take days/weeks to train. Our calculator trains in minutes.
  3. The concepts are identical — Tokenization, attention, transformers—it's all the same, just smaller

Once you understand the calculator, scaling up is straightforward.

The Task

Our model will convert English math phrases into English number answers:

"two plus three"         → "five"
"seven minus four"       → "three"
"six times eight"        → "forty eight"
"twenty divided by five" → "four"

Notice: both input and output are words, not digits. The model never sees "2 + 3 = 5"—it only sees "two plus three" and learns to predict "five".

The Dataset

Unlike real LLMs that train on internet text (Wikipedia, books, websites), we'll generate our own dataset programmatically. This is one of the beauties of our calculator project—we don't need to find or download any corpus.

# We write code to generate training examples:
import random

def generate_example():
    a = random.randint(0, 99)  # e.g., 23
    b = random.randint(0, 99)  # e.g., 15
    op = "plus"
    result = a + b             # 38

    # Convert to words (we'll write this helper function):
    # to_words(23) → "twenty three"
    # to_words(38) → "thirty eight"
    return to_words(a) + " plus " + to_words(b), to_words(result)

# Generate 1000 examples in seconds!
Why this matters: Real LLM training requires terabytes of carefully curated text data. Our approach lets us create unlimited, perfectly labeled training examples instantly.

Bonus insight: Our random.randint(0, 99) creates a uniform distribution—every number appears equally. Real language follows a Zipfian (power law) distribution: "1+1" appears millions of times while "97+6" appears once. Our uniform data actually makes training easier and prevents the model from memorizing common cases.

Our Vocabulary

CategoryTokens
Numbers 0-19zero, one, two, ... nineteen
Tenstwenty, thirty, forty, ... ninety
Operationsplus, minus, times, divided, by
Special[PAD], [START], [END]

That's roughly 30 tokens total. Compare this to GPT-4's ~100,000 tokens!

Tokenization choice: We'll use whole-word tokenization (splitting by spaces). Real LLMs use sub-word tokenization (BPE/WordPiece), where "ninety" might split into "nine" + "##ty". Our approach is simpler and works perfectly for our fixed vocabulary.

We'll generate 500-1000 examples covering addition, subtraction, multiplication, and division with numbers 0-99.

Model Specifications

Here's how our tiny model compares to GPT-4:

SpecOur ModelGPT-4
Parameters~1-2 million~1.7 trillion
Embedding dim64-12812,288
Layers2-4~96
Vocabulary~30~100,000
Max sequence length~10-20 tokens32k-128k tokens
Training time5-10 minMonths

Our model is about a million times smaller—but it uses the exact same architecture!

What It Can & Cannot Do

Can Do

  • Basic arithmetic with numbers 0-99
  • Single operations (one plus, minus, times, or divided by)
  • Output results as English words

Cannot Do

  • Numbers above 99
  • Chained operations ("two plus three minus one")
  • Decimals or fractions
  • Parentheses or order of operations
These limitations are by design. We're building a learning tool, not a production calculator. The concepts transfer directly to larger models.

What You'll Create

By the end of this series, you'll have built these files from scratch:

FileDescription
tokenizer.pyConverts text to token IDs
embeddings.pyConverts token IDs to vectors
attention.pyThe attention mechanism
transformer.pyTransformer blocks
model.pyComplete model architecture
dataset.pyCalculator dataset generator
train.pyTraining loop
generate.pyText generation
app.pyGradio demo for Hugging Face
Source Code: The complete code for this tutorial is available at github.com/slahiri/small_calculator_model

End-to-End Preview

Here's what the final result looks like:

from model import CalculatorLLM

model = CalculatorLLM.load("calculator-llm.pt")

model.calculate("two plus three")      # → "five"
model.calculate("nine times nine")     # → "eighty one"
model.calculate("fifty minus twelve")  # → "thirty eight"

The .calculate() method is a convenience wrapper. Under the hood, it's still doing autoregressive generation—appending "two plus three" + predicted token, checking for [END], and returning just the answer portion.

Try It Live

At the end of this series, we'll deploy our model to Hugging Face Spaces—a free platform to host ML demos. You'll be able to share your working calculator LLM with anyone via a simple URL.

Your friends can test your model in their browser—no Python or setup required. Just type "seven times eight" and watch your model respond "fifty six".