Maximum entropy models, also known as log-linear or multinomial logistic regression models, provide a discriminative framework for POS tagging that overcomes the key limitations of generative HMM taggers. Rather than modeling the joint probability of words and tags, MaxEnt models directly estimate the conditional probability of each tag given the observed features. This allows the use of arbitrary, overlapping features of the input context without worrying about feature independence.
Log-Linear Model
Z = ∑t′ exp(∑k λk fk(t′, w, context))
Features fk can include:
• Current word wi, previous word wi−1, next word wi+1
• Previous tag ti−1, previous two tags ti−2ti−1
• Suffixes, prefixes, capitalization, word shape
• Any conjunction of the above
The model assigns a weight λk to each feature function fk. Feature functions are typically binary indicators that fire when a specific condition holds (e.g., "current word is 'run' and tag is VB"). The maximum entropy principle selects the distribution with the highest entropy (least assumptions) among all distributions that match the empirical feature expectations from the training data. Parameter estimation uses iterative algorithms such as Generalized Iterative Scaling (GIS), Improved Iterative Scaling (IIS), or gradient-based methods like L-BFGS.
MEMM and CRF Extensions
The Maximum Entropy Markov Model (MEMM) extends MaxEnt tagging by conditioning on previous tags in a left-to-right sequence, using Viterbi decoding to find the globally best tag sequence. However, MEMMs suffer from the label bias problem: states with few outgoing transitions effectively ignore the input. Conditional Random Fields (CRFs) address this by using a globally normalized model over the entire sequence, making them the preferred discriminative sequence model for POS tagging until the advent of neural approaches.
Legacy and Modern Context
MaxEnt taggers represented a major advance over HMM taggers by enabling richer feature representations, pushing accuracy from ~96% to ~97% on the Penn Treebank. The framework directly influenced the development of CRFs and structured prediction methods more broadly. While modern neural taggers (using BiLSTMs or Transformers) have further improved accuracy and eliminated the need for manual feature engineering, the MaxEnt framework remains conceptually important and is still used as a component in larger systems.