Workshop 3: Tokenization

How LLMs split text into pieces

What is tokenization?

LLMs don't read letters or words — they read tokens. A token might be a word, part of a word, or even a single character. The way text gets split affects how the model "sees" your input.

The Tool

Go to platform.openai.com/tokenizer

Type any text and see how it gets split into tokens. Each color is a different token.

Things to Try

Type a simple sentence. How many tokens?
Try a long word vs. multiple short words
Try the same word in lowercase, UPPERCASE, and Capitalized
Type a word with a typo — what happens?
Try numbers: 100 vs. 1000 vs. 10000
Try different languages (French, English, Chinese, Arabic...)
Try code: function hello() { return "world"; }
Try emojis 🎉🤖🔥 or 👨🏿‍🔬

Questions to Consider

Why do common words get their own token but rare words get split?
Why does whitespace matter?
Why might non-English languages use more tokens for the same meaning?
How could tokenization affect the cost of using an API (priced per token)?
If the model sees "un", "believ", "able" as separate tokens, does it know they form one word?
One big LLM limitation is counting letters in a word. (eg: how many "r" in "strawberry"?) Can you guess why?