Maximum Entropy Tagger

Maximum entropy models, also known as log-linear or multinomial logistic regression models, provide a discriminative framework for POS tagging that overcomes the key limitations of generative HMM taggers. Rather than modeling the joint probability of words and tags, MaxEnt models directly estimate the conditional probability of each tag given the observed features. This allows the use of arbitrary, overlapping features of the input context without worrying about feature independence.

Log-Linear Model

Maximum Entropy Model P(t | w, context) = (1/Z) exp(∑_k λ_k f_k(t, w, context))

Z = ∑_t′ exp(∑_k λ_k f_k(t′, w, context))

Features f_k can include:
• Current word w_i, previous word w_i−1, next word w_i+1
• Previous tag t_i−1, previous two tags t_i−2t_i−1
• Suffixes, prefixes, capitalization, word shape
• Any conjunction of the above

The model assigns a weight λ_k to each feature function f_k. Feature functions are typically binary indicators that fire when a specific condition holds (e.g., "current word is 'run' and tag is VB"). The maximum entropy principle selects the distribution with the highest entropy (least assumptions) among all distributions that match the empirical feature expectations from the training data. Parameter estimation uses iterative algorithms such as Generalized Iterative Scaling (GIS), Improved Iterative Scaling (IIS), or gradient-based methods like L-BFGS.

MEMM and CRF Extensions

The Maximum Entropy Markov Model (MEMM) extends MaxEnt tagging by conditioning on previous tags in a left-to-right sequence, using Viterbi decoding to find the globally best tag sequence. However, MEMMs suffer from the label bias problem: states with few outgoing transitions effectively ignore the input. Conditional Random Fields (CRFs) address this by using a globally normalized model over the entire sequence, making them the preferred discriminative sequence model for POS tagging until the advent of neural approaches.

Feature Engineering

The power of MaxEnt taggers lies in feature engineering. Ratnaparkhi's (1996) influential tagger used features including the current word, surrounding words, previous tags, prefixes and suffixes up to length 4, and whether the word contains a number, hyphen, or uppercase letter. Toutanova et al. (2003) achieved 97.2% accuracy by adding bidirectional dependency features via a cyclic dependency network.

Legacy and Modern Context

MaxEnt taggers represented a major advance over HMM taggers by enabling richer feature representations, pushing accuracy from ~96% to ~97% on the Penn Treebank. The framework directly influenced the development of CRFs and structured prediction methods more broadly. While modern neural taggers (using BiLSTMs or Transformers) have further improved accuracy and eliminated the need for manual feature engineering, the MaxEnt framework remains conceptually important and is still used as a component in larger systems.

Log-Linear Model

MEMM and CRF Extensions

Legacy and Modern Context

References

External Links

Log-Linear Model

MEMM and CRF Extensions

Legacy and Modern Context

Related Topics

References

External Links