← Back to all workshops

Workshop 3: Tokenization

How LLMs split text into pieces

What is tokenization?

LLMs don't read letters or words — they read tokens. A token might be a word, part of a word, or even a single character. The way text gets split affects how the model "sees" your input.

The Tool

Go to platform.openai.com/tokenizer

Type any text and see how it gets split into tokens. Each color is a different token.

Things to Try

  1. Type a simple sentence. How many tokens?
  2. Try a long word vs. multiple short words
  3. Try the same word in lowercase, UPPERCASE, and Capitalized
  4. Type a word with a typo — what happens?
  5. Try numbers: 100 vs. 1000 vs. 10000
  6. Try different languages (French, English, Chinese, Arabic...)
  7. Try code: function hello() { return "world"; }
  8. Try emojis 🎉🤖🔥 or 👨🏿‍🔬

Questions to Consider