A Complete Reference

Computational Linguistics

Computational linguistics is the scientific discipline that uses formal, mathematical, and computational methods to model, analyze, and generate human language — from parsing sentences and translating between languages to building the large language models that power modern AI.

Where traditional linguistics asks “what are the rules of language?”, computational linguistics asks “how can we formalize those rules so that machines can process language automatically?” It transforms linguistic theories into algorithms, enabling machines to understand, generate, and translate text and speech.

This reference covers the full landscape — from formal foundations in automata theory and grammar formalisms through parsing, semantics, and language modeling, to machine translation, text analysis, speech processing, and discourse.

Language = f(grammar, probability, computation, data)

Key Concepts

Core mathematical and computational constructs used in computational linguistics.

Perplexity

A measure of how well a probability model predicts a sample, defined as the exponentiated average negative log-likelihood per token. Lower perplexity indicates that the model assigns higher probability to the test data, reflecting better language modeling performance.

PP(W) = P(w_1, w_2, ..., w_N)^(-1/N) = 2^(H(W))

Example

A trigram language model achieving perplexity of 80 on a news corpus means the model is, on average, as uncertain as if it had to choose uniformly among 80 words at each position.

tf-idf

TF-IDF

Term Frequency-Inverse Document Frequency is a numerical statistic reflecting how important a word is to a document within a corpus. It increases proportionally with the number of times a word appears in a document but is offset by the frequency of the word across all documents, down-weighting common terms.

tf-idf(t, d, D) = tf(t, d) * log(|D| / |{d in D : t in d}|)

Example

In a corpus of medical papers, the word 'the' has high term frequency but very low IDF (it appears in nearly every document), yielding a low TF-IDF score. The term 'carcinoma' appears frequently in oncology papers but rarely elsewhere, earning a high TF-IDF weight.

BLEU

BLEU Score

Bilingual Evaluation Understudy is an automatic metric for evaluating the quality of machine-translated text by measuring modified n-gram precision against one or more reference translations. A brevity penalty prevents degenerate short translations from scoring artificially high.

BLEU = BP * exp(Sum_{n=1}^{N} w_n * log(p_n)); BP = min(1, exp(1 - r/c))

Example

A machine translation system producing 'The cat sat on the mat' against reference 'The cat is sitting on the mat' would receive high unigram and bigram precision but lower 4-gram precision, yielding a BLEU-4 score around 0.45.

Attn

Attention Mechanism

A neural network component that computes a weighted sum of value vectors, where the weights are derived from the compatibility between a query and a set of keys. Scaled dot-product attention divides by the square root of the key dimension to prevent vanishing gradients from softmax saturation in high-dimensional spaces.

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Example

When translating 'The animal didn't cross the street because it was too tired,' the attention mechanism allows the pronoun 'it' to attend strongly to 'animal' rather than 'street,' resolving the coreference ambiguity.

PMI

Pointwise Mutual Information

A measure of association between two events that quantifies how much more (or less) likely they are to co-occur than expected under independence. Positive PMI indicates that the co-occurrence is more frequent than chance, making it a foundational tool for discovering collocations and semantic associations in text.

PMI(x, y) = log2(P(x, y) / (P(x) * P(y)))

Example

In a large corpus, PMI('New', 'York') is high because these words co-occur far more often than their individual frequencies would predict. In contrast, PMI('the', 'of') is low despite both being frequent, because their co-occurrence is close to what independence would predict.

H(p,q)

Cross-Entropy

The average number of bits needed to encode data drawn from distribution p using a coding scheme optimized for distribution q. In language modeling, cross-entropy measures how well the model distribution q approximates the true distribution p of the language, serving as the standard training objective.

H(p, q) = -Sum_x p(x) * log2(q(x))

Example

If a language model assigns probability 0.1 to the correct next word on average, the cross-entropy is approximately 3.32 bits per word. A perfect model matching the true distribution would achieve cross-entropy equal to the language's true entropy.

e_w

Word Embeddings

Dense, low-dimensional vector representations of words learned from co-occurrence patterns in large corpora. Words with similar distributional contexts receive nearby vectors in the embedding space, capturing semantic and syntactic regularities as geometric relationships.

e_w in R^d, where d is typically 100-300; similarity(w_i, w_j) = cos(e_i, e_j)

Example

The famous relationship vec('king') - vec('man') + vec('woman') approximately equals vec('queen') demonstrates that word embeddings encode analogical relationships as linear translations in vector space.

d_edit

Edit Distance

The minimum number of single-character operations (insertions, deletions, and substitutions) required to transform one string into another, also known as Levenshtein distance. It provides a fundamental metric for measuring string similarity used in spelling correction, DNA sequence alignment, and approximate string matching.

d(i,j) = min(d(i-1,j)+1, d(i,j-1)+1, d(i-1,j-1)+[s_i != t_j])

Example

The edit distance between 'kitten' and 'sitting' is 3: substitute k->s, substitute e->i, insert g. This distance underlies spelling correction systems that suggest 'sitting' as a correction for 'siting'.

P(w_n|w_{n-k})

N-gram Probability

The conditional probability of a word given its preceding context of n-1 words, estimated from corpus frequency counts. N-gram models apply the Markov assumption to make language modeling tractable, approximating the full joint probability of a word sequence as a product of local conditional probabilities.

P(w_n | w_1^{n-1}) approx P(w_n | w_{n-N+1}^{n-1}) = C(w_{n-N+1}^n) / C(w_{n-N+1}^{n-1})

Example

In a bigram model, P('learning' | 'machine') = Count('machine learning') / Count('machine'). If 'machine learning' appears 5000 times and 'machine' appears 20000 times in a corpus, the bigram probability is 0.25.

P/R/F1

Precision, Recall & F1

Precision measures the fraction of predicted positive instances that are truly positive, while recall measures the fraction of actual positives that are correctly predicted. The F1 score is their harmonic mean, providing a single metric that balances both concerns and is especially useful when class distributions are imbalanced.

P = TP/(TP+FP); R = TP/(TP+FN); F1 = 2PR/(P+R)

Example

A named entity recognition system that identifies 80 out of 100 actual entities (R=0.80) and of its 90 predictions, 80 are correct (P=0.89), achieves an F1 score of 2*0.89*0.80/(0.89+0.80) = 0.84.

CRF

Conditional Random Field

A discriminative undirected graphical model that defines a conditional probability distribution over label sequences given an observation sequence. Unlike generative models such as HMMs, CRFs can incorporate arbitrary overlapping features of the input without making independence assumptions about the observations.

P(y|x) = (1/Z(x)) * exp(Sum_t Sum_k lambda_k * f_k(y_t, y_{t-1}, x, t))

Example

In part-of-speech tagging, a CRF can simultaneously consider that the current word ends in '-ing' (suggesting a verb), the previous tag was a determiner (favoring a noun or adjective), and the word appears in a gazetteer of proper nouns, all as features in one model.

tau

Softmax Temperature

A scalar parameter that controls the sharpness of the probability distribution produced by the softmax function. Lower temperatures sharpen the distribution toward a one-hot encoding (greedy selection), while higher temperatures flatten it toward uniform (more random sampling), providing a tunable exploration-exploitation tradeoff in text generation.

P(w_i) = exp(z_i / tau) / Sum_j exp(z_j / tau)

Example

When generating text with a language model, temperature tau=0.2 produces highly deterministic output favoring the most likely next token, while tau=1.5 produces more creative and diverse text by distributing probability more uniformly across candidates.

I(x)

Self-Information

The information content of a single event, measured in bits when using log base 2. It quantifies the 'surprise' of observing an outcome: rare events carry more information than common ones. Self-information is the building block from which entropy and cross-entropy are derived.

I(x) = -log2(P(x))

Example

If a language model assigns P('the')=0.07 and P('serendipity')=0.0001, then I('the')=3.84 bits and I('serendipity')=13.29 bits, reflecting that encountering 'serendipity' is far more surprising and informative.

cos(theta)

Cosine Similarity

A measure of similarity between two non-zero vectors defined as the cosine of the angle between them, ranging from -1 (opposite) through 0 (orthogonal) to 1 (identical direction). In NLP, it is the standard metric for comparing word embeddings and document vectors because it is invariant to vector magnitude, focusing purely on directional similarity.

cos(A, B) = (A . B) / (||A|| * ||B||) = Sum_i A_i B_i / (sqrt(Sum_i A_i^2) * sqrt(Sum_i B_i^2))

Example

The cosine similarity between the word vectors for 'dog' and 'puppy' might be 0.85, while the similarity between 'dog' and 'algebra' might be 0.12, reflecting the semantic distance between these word pairs.