Build Your First LLM from ScratchPart 3 · Section 3 of 13
Building the Tokenizer
The tokenizer converts text to token IDs and back. It has two main methods:
class Tokenizer:
def __init__(self, vocabulary: dict[str, int]):
self.word_to_id = vocabulary
self.id_to_word = {id: word for word, id in vocabulary.items()}
def normalize(self, text: str) -> str:
"""Handle variations like 'thirtysix' or '+'."""
# Split compound numbers: "thirtysix" → "thirty six"
tens = ["twenty", "thirty", "forty", "fifty",
"sixty", "seventy", "eighty", "ninety"]
units = ["one", "two", "three", "four", "five",
"six", "seven", "eight", "nine"]
for ten in tens:
for unit in units:
text = text.replace(ten + unit, ten + " " + unit)
# Replace symbols with words
text = text.replace("+", " plus ").replace("-", " minus ")
text = text.replace("*", " times ").replace("/", " divided by ")
text = text.replace(",", "").replace(".", "").replace("?", "")
return text.lower()
def encode(self, text: str) -> list[int]:
"""Convert text to token IDs with [START] and [END]."""
text = self.normalize(text)
words = text.split()
ids = [self.word_to_id["[START]"]]
ids += [self.word_to_id[word] for word in words]
ids += [self.word_to_id["[END]"]]
return ids
def decode(self, ids: list[int]) -> str:
"""Convert token IDs back to text."""
words = [self.id_to_word[id] for id in ids]
return " ".join(words)Usage:
tokenizer = Tokenizer(vocabulary)
# Standard input
ids = tokenizer.encode("two plus three")
print(ids) # [1, 5, 31, 6, 2]
# [START] "two" "plus" "three" [END]
# Also handles variations!
ids = tokenizer.encode("thirtysix + seventytwo")
print(ids) # [1, 24, 9, 31, 28, 5, 2]
# [START] thirty six plus seventy two [END]
# Decode: IDs → text
text = tokenizer.decode([1, 5, 31, 6, 2])
print(text) # "[START] two plus three [END]"Why normalize? Users might type "thirtysix" (no space) or use symbols like "+". The normalize method splits compound words and converts symbols to our vocabulary words. This makes the demo robust without complicating the core tokenization logic.
At Scale
Production tokenizers use subword algorithms but have the same interface:
# OpenAI's tokenizer
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
ids = enc.encode("Hello, world!") # [9906, 11, 1917, 0]
# Hugging Face tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
ids = tokenizer.encode("Hello, world!") # [15496, 11, 995, 0]Same concept:
text → token IDs → text. The only difference is how the vocabulary is built and how words are split.Helpful?