Computational Linguistics
About

Maximum Entropy Tagger

Maximum entropy (MaxEnt) POS taggers use log-linear models that can incorporate arbitrary overlapping features of the input, overcoming the independence limitations of generative HMM taggers.

P(t_i | w, t_{i-1}) = exp(∑_k λ_k f_k(t_i, w, t_{i-1})) / Z(w, t_{i-1})

Maximum entropy models, also known as log-linear or multinomial logistic regression models, provide a discriminative framework for POS tagging that overcomes the key limitations of generative HMM taggers. Rather than modeling the joint probability of words and tags, MaxEnt models directly estimate the conditional probability of each tag given the observed features. This allows the use of arbitrary, overlapping features of the input context without worrying about feature independence.

Log-Linear Model

Maximum Entropy Model P(t | w, context) = (1/Z) exp(∑k λk fk(t, w, context))

Z = ∑t′ exp(∑k λk fk(t′, w, context))

Features fk can include:
• Current word wi, previous word wi−1, next word wi+1
• Previous tag ti−1, previous two tags ti−2ti−1
• Suffixes, prefixes, capitalization, word shape
• Any conjunction of the above

The model assigns a weight λk to each feature function fk. Feature functions are typically binary indicators that fire when a specific condition holds (e.g., "current word is 'run' and tag is VB"). The maximum entropy principle selects the distribution with the highest entropy (least assumptions) among all distributions that match the empirical feature expectations from the training data. Parameter estimation uses iterative algorithms such as Generalized Iterative Scaling (GIS), Improved Iterative Scaling (IIS), or gradient-based methods like L-BFGS.

MEMM and CRF Extensions

The Maximum Entropy Markov Model (MEMM) extends MaxEnt tagging by conditioning on previous tags in a left-to-right sequence, using Viterbi decoding to find the globally best tag sequence. However, MEMMs suffer from the label bias problem: states with few outgoing transitions effectively ignore the input. Conditional Random Fields (CRFs) address this by using a globally normalized model over the entire sequence, making them the preferred discriminative sequence model for POS tagging until the advent of neural approaches.

Feature Engineering
The power of MaxEnt taggers lies in feature engineering. Ratnaparkhi's (1996) influential tagger used features including the current word, surrounding words, previous tags, prefixes and suffixes up to length 4, and whether the word contains a number, hyphen, or uppercase letter. Toutanova et al. (2003) achieved 97.2% accuracy by adding bidirectional dependency features via a cyclic dependency network.

Legacy and Modern Context

MaxEnt taggers represented a major advance over HMM taggers by enabling richer feature representations, pushing accuracy from ~96% to ~97% on the Penn Treebank. The framework directly influenced the development of CRFs and structured prediction methods more broadly. While modern neural taggers (using BiLSTMs or Transformers) have further improved accuracy and eliminated the need for manual feature engineering, the MaxEnt framework remains conceptually important and is still used as a component in larger systems.

Related Topics

References

  1. Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 133–142. https://aclanthology.org/W96-0213
  2. Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. Proceedings of HLT-NAACL 2003, 173–180. https://doi.org/10.3115/1073445.1073478
  3. Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71. https://doi.org/10.5555/234285.234289
  4. McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information extraction and segmentation. Proceedings of ICML 2000, 591–598.

External Links