Computational Linguistics
About

Word2Vec

Word2Vec is a family of neural network models that learn dense vector representations of words by predicting context words, capturing semantic and syntactic regularities in an efficient low-dimensional embedding space.

P(w_O | w_I) = exp(v'_{w_O} . v_{w_I}) / sum_w exp(v'_w . v_{w_I})

Word2Vec, introduced by Tomas Mikolov and colleagues at Google in 2013, revolutionized distributional semantics by demonstrating that simple neural network architectures trained on large corpora produce word embeddings with remarkable algebraic properties. The model comes in two variants: Continuous Bag-of-Words (CBOW), which predicts a target word from its context, and Skip-gram, which predicts context words from a target word. Both produce dense, low-dimensional vectors that capture semantic relationships through vector arithmetic.

Skip-gram and CBOW Architectures

Skip-gram Objective Maximize: (1/T) Σ_{t=1}^{T} Σ_{-c≤j≤c, j≠0} log P(w_{t+j} | w_t)

Softmax: P(w_O | w_I) = exp(v'_{w_O}^T v_{w_I}) / Σ_{w=1}^{W} exp(v'_w^T v_{w_I})

Negative sampling approximation:
log σ(v'_{w_O}^T v_{w_I}) + Σ_{i=1}^{k} E_{w_i ~ P_n(w)} [log σ(−v'_{w_i}^T v_{w_I})]

The Skip-gram model uses a shallow neural network with one hidden layer. Given a target word, it maximizes the probability of observing nearby context words within a window of size c. Computing the full softmax over the entire vocabulary is expensive, so two efficient approximations are used: hierarchical softmax, which uses a binary tree over the vocabulary, and negative sampling, which approximates the softmax by contrasting the target context pair against randomly sampled negative pairs. Negative sampling with 5–20 negative samples per positive example typically yields the best results.

Algebraic Properties

The most celebrated property of Word2Vec embeddings is their capacity to capture semantic relationships through vector arithmetic. The relation "king - man + woman ≈ queen" demonstrates that the vector offset between "man" and "woman" encodes a gender relation that transfers across word pairs. Similar regularities hold for syntactic relations (e.g., "walking - walk + swim ≈ swimming") and other semantic relations (country-capital, adjective-comparative). Levy and Goldberg (2014) showed that Skip-gram with negative sampling implicitly factorizes a shifted PMI matrix, connecting the neural approach to traditional count-based distributional semantics.

Training Considerations

Word2Vec's performance depends critically on hyperparameters. Larger context windows capture broader topical similarity, while smaller windows emphasize syntactic and functional similarity. Subsampling of frequent words (discarding occurrences of very common words with a probability proportional to their frequency) improves both training speed and the quality of representations for rare words. The dimensionality of the embedding space (typically 100–300) trades off between expressiveness and generalization.

Impact and Legacy

Word2Vec's impact on NLP was transformative. Pre-trained word embeddings became a standard feature in virtually every NLP system, replacing sparse one-hot or bag-of-words representations. The model demonstrated that unsupervised learning on raw text could capture substantial linguistic knowledge, presaging the pre-training revolution that would later produce ELMo, BERT, and GPT. Word2Vec also stimulated research on bias in embeddings, as Bolukbasi et al. (2016) showed that word vectors encode societal stereotypes present in training corpora.

Numerous extensions followed Word2Vec. FastText extended the model to use subword information, GloVe combined global co-occurrence statistics with local context prediction, and various retrofitting methods incorporated knowledge from lexical resources. While contextualized models have since surpassed static embeddings on most benchmarks, Word2Vec remains widely used for its simplicity, efficiency, and interpretability.

Interactive Calculator

Enter multiple documents, one per line. The calculator computes TF-IDF vectors for each document and pairwise cosine similarity between all document pairs.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of ICLR Workshop.
  2. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (pp. 3111–3119).
  3. Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27 (pp. 2177–2185).
  4. Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Advances in Neural Information Processing Systems 29 (pp. 4349–4357).

External Links