← Back to all workshops
Workshop 4: Train Your Own Embeddings
See how Word2Vec learns from text
What is this?
In previous workshops, we used pre-trained embedding models to analyze and compare word similarities. Now you'll train your own embedding model from scratch using a small toy model: Word2Vec.
The Tool
Go to remykarem.github.io/word2vec-demo
This is a web implementation of the Word2Vec algorithm, it is limited in scope as your browser and computer are not mearnt for training models but it demonstrates the core concepts.
Step by Step
- Paste some text song lyrics, a paragraph, anything that is short enough not to crash your browser
- Choose a model:
- Skip-gram: predict context from a word
- CBOW: predict a word from its context
- Set window size: how many words around each word to consider
- Click Generate dataset to see the training examples
- Set embedding size, learning rate, epochs (or leaving them as is)
- Click Train model and wait for training to complete
- Click Run t-SNE to visualize the learned embeddings in 2D
Questions to Consider
- What happens if a word only appears once?
- Why might a larger window size capture different relationships than a smaller one?
- How much text would you need to get useful embeddings?
- The real Word2Vec was trained on ~100 billion words. What would change?