← Back to all workshops
Workshop 3: Tokenization
How LLMs split text into pieces
What is tokenization?
LLMs don't read letters or words — they read tokens. A token might be a word, part of a word, or even a single character. The way text gets split affects how the model "sees" your input.
The Tool
Go to platform.openai.com/tokenizer
Type any text and see how it gets split into tokens. Each color is a different token.
Things to Try
- Type a simple sentence. How many tokens?
- Try a long word vs. multiple short words
- Try the same word in lowercase, UPPERCASE, and Capitalized
- Type a word with a typo — what happens?
- Try numbers:
100 vs. 1000 vs. 10000
- Try different languages (French, English, Chinese, Arabic...)
- Try code:
function hello() { return "world"; }
- Try emojis 🎉🤖🔥 or 👨🏿🔬
Questions to Consider
- Why do common words get their own token but rare words get split?
- Why does whitespace matter?
- Why might non-English languages use more tokens for the same meaning?
- How could tokenization affect the cost of using an API (priced per token)?
- If the model sees "un", "believ", "able" as separate tokens, does it know they form one word?
- One big LLM limitation is counting letters in a word. (eg: how many "r" in "strawberry"?) Can you guess why?