Computational Linguistics
About

GloVe

GloVe (Global Vectors for Word Representation) learns word embeddings by factorizing a log-bilinear model of the global word-word co-occurrence matrix, combining the advantages of count-based and prediction-based methods.

J = sum_{i,j=1}^{V} f(X_{ij}) (w_i^T w_j + b_i + b_j - log X_{ij})^2

GloVe, introduced by Pennington, Socher, and Manning at Stanford in 2014, was designed to address a perceived gap between count-based distributional models (like LSA) and prediction-based models (like Word2Vec). The authors argued that count-based methods efficiently leverage global statistical information but perform poorly on analogy tasks, while prediction-based methods capture fine-grained semantic patterns but only use local context windows. GloVe bridges these approaches by training a log-bilinear regression model on the global word-word co-occurrence matrix.

The GloVe Objective

GloVe Cost Function J = Σ_{i,j=1}^{V} f(X_{ij}) · (w_i^T w̃_j + b_i + b̃_j − log X_{ij})²

Weighting function:
f(x) = (x / x_max)^α if x < x_max
f(x) = 1 otherwise

Typically α = 3/4, x_max = 100

GloVe starts from the insight that word vector differences should encode ratios of co-occurrence probabilities. If "ice" co-occurs frequently with "solid" but not with "gas," while "steam" co-occurs frequently with "gas" but not with "solid," then the ratio P(solid | ice) / P(solid | steam) should be large, and this ratio should be captured by the vector representations. The cost function trains word vectors w_i and context vectors w̃_j such that their dot product approximates the logarithm of their co-occurrence count, weighted by a function f that downweights very frequent pairs.

Training and Properties

GloVe training involves first constructing the full co-occurrence matrix X from the corpus, then optimizing the weighted least squares objective using stochastic gradient descent over the nonzero entries of X. The final word vectors are typically the sum of the word and context vectors (w + w̃), exploiting the symmetry of the model. GloVe vectors demonstrate strong performance on word analogy tasks, word similarity benchmarks, and named entity recognition, rivaling or exceeding Word2Vec on many evaluations.

Count-Based vs. Prediction-Based: A Unified View

Levy, Goldberg, and Dagan (2015) conducted a systematic comparison of count-based and prediction-based methods, showing that much of Word2Vec's advantage over traditional count-based methods is attributable to specific hyperparameter choices and preprocessing steps rather than the fundamental algorithmic distinction. When count-based methods use the same hyperparameters (PPMI weighting, context distribution smoothing, SVD dimensionality), they achieve comparable performance. This finding suggests that GloVe, Word2Vec, and SVD-based PPMI models are different approaches to the same underlying objective.

Applications and Extensions

Pre-trained GloVe vectors (trained on Common Crawl, Wikipedia, and Twitter corpora in dimensions of 25 to 300) have been widely adopted as initialization for neural NLP models. GloVe embeddings serve as the input layer for text classification, sequence labeling, machine translation, and many other tasks. The availability of high-quality pre-trained vectors dramatically reduced the amount of task-specific training data needed for these applications.

GloVe has been extended in several directions. Mittens (Dingwall and Potts, 2018) adapts GloVe for domain-specific embeddings by using general-purpose GloVe vectors as a regularization target. Dynamic GloVe tracks meaning change over time by training on temporally sliced corpora. The GloVe framework has also been adapted for learning embeddings of other discrete objects, including nodes in graphs and items in recommendation systems.

Related Topics

References

  1. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). doi:10.3115/v1/D14-1162
  2. Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225. doi:10.1162/tacl_a_00134
  3. Dingwall, N., & Potts, C. (2018). Mittens: An extension of GloVe for learning domain-specialized representations. In Proceedings of NAACL-HLT (pp. 212–217). doi:10.18653/v1/N18-2034

External Links