Document representation is the process of converting text into a structured mathematical form suitable for computation. Every text analysis algorithm operates not on raw text but on some representation of it, making the choice of representation one of the most consequential decisions in any NLP pipeline. The history of document representation traces a trajectory from sparse, discrete representations based on word counts to dense, continuous representations learned by neural networks, each successive paradigm capturing increasingly rich linguistic information.
Bag-of-Words and TF-IDF
tf(t, d) = count of term t in document d
idf(t) = log(N / df(t))
where N is the total number of documents and df(t) is the number of documents containing term t
The bag-of-words (BoW) model represents a document as a vector of word counts, discarding word order entirely. Despite this drastic simplification, BoW representations are effective for many classification and retrieval tasks because the distribution of words in a document carries substantial information about its topic. TF-IDF weighting refines BoW by upweighting terms that are frequent in a document but rare across the corpus, capturing term specificity. Karen Sparck Jones introduced IDF in 1972, and TF-IDF remains one of the most widely used weighting schemes in information retrieval and text classification.
Distributed and Contextual Representations
Distributed representations address the fundamental limitations of sparse BoW vectors: their inability to capture semantic similarity (the vectors for "car" and "automobile" are orthogonal) and their high dimensionality (equal to the vocabulary size). Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) learn dense vector representations where semantically similar words occupy nearby points in a continuous vector space. Document representations can be constructed by averaging word embeddings, using Doc2Vec, or applying more sophisticated composition functions.
All distributed representations rest on the distributional hypothesis articulated by Zellig Harris (1954) and popularised by J. R. Firth's dictum that "you shall know a word by the company it keeps." Words that appear in similar contexts tend to have similar meanings, and this statistical regularity provides the signal that embedding algorithms exploit. The success of word embeddings validated decades of theoretical work in distributional semantics and established vector space models as the dominant paradigm in computational semantics.
Contextual embeddings from pretrained transformer models such as BERT, ELMo, and GPT represent the current frontier of document representation. Unlike static embeddings, which assign a single vector to each word type, contextual embeddings produce different representations for each word token depending on its surrounding context, naturally handling polysemy and context-dependent meaning. A document can be represented by the special [CLS] token embedding in BERT, by averaging token embeddings, or by pooling strategies that capture different aspects of the document's content. These representations have set new state-of-the-art results across virtually all text analysis benchmarks.