Workshop 4: Train Your Own Embeddings

See how Word2Vec learns from text

What is this?

In previous workshops, we used pre-trained embedding models to analyze and compare word similarities. Now you'll train your own embedding model from scratch using a small toy model: Word2Vec.

The Tool

Go to remykarem.github.io/word2vec-demo

This is a web implementation of the Word2Vec algorithm, it is limited in scope as your browser and computer are not mearnt for training models but it demonstrates the core concepts.

Step by Step

Paste some text song lyrics, a paragraph, anything that is short enough not to crash your browser
Choose a model:
- Skip-gram: predict context from a word
- CBOW: predict a word from its context
Set window size: how many words around each word to consider
Click Generate dataset to see the training examples
Set embedding size, learning rate, epochs (or leaving them as is)
Click Train model and wait for training to complete
Click Run t-SNE to visualize the learned embeddings in 2D

Questions to Consider

What happens if a word only appears once?
Why might a larger window size capture different relationships than a smaller one?
How much text would you need to get useful embeddings?
The real Word2Vec was trained on ~100 billion words. What would change?