Part-of-Speech Tagging

Part-of-speech (POS) tagging is the task of assigning a grammatical category label to each word (or token) in a sentence. These labels indicate both the broad syntactic class (noun, verb, adjective, adverb, etc.) and morphosyntactic features (tense, number, case). POS tagging is one of the oldest and most fundamental tasks in NLP, serving as a prerequisite for nearly all downstream syntactic and semantic analysis. The task is challenging because many words are ambiguous: "bank" can be a noun or verb, "that" can be a determiner, pronoun, or complementizer.

Tagsets

Common POS Tagsets Penn Treebank (PTB): 45 tags
  NN (noun, singular), NNS (noun, plural), NNP (proper noun)
  VB (verb, base), VBD (past tense), VBG (gerund), VBN (past participle)
  JJ (adjective), RB (adverb), DT (determiner), IN (preposition)

Universal POS (UPOS): 17 tags
  NOUN, VERB, ADJ, ADV, ADP, DET, PRON, NUM, ...

Most frequent baseline: ~90% accuracy
State-of-the-art: ~97.5% accuracy (PTB)

The two most widely used tagsets are the Penn Treebank tagset (45 tags for English) and the Universal POS tagset (17 tags across all languages). The PTB tagset makes finer distinctions (e.g., six verb forms) that are important for English syntax, while the UPOS tagset prioritizes cross-linguistic consistency. A simple baseline that assigns each word its most frequent tag achieves about 90% accuracy; the remaining 10% consists of genuinely ambiguous cases requiring context.

Methods

POS tagging methods have evolved from rule-based systems (using hand-written disambiguation rules) through statistical models (HMMs, MEMMs, CRFs) to neural approaches. HMM taggers use the Viterbi algorithm to find the most likely tag sequence given the observed words. CRF taggers model the conditional probability of the tag sequence directly, avoiding the independence assumptions of HMMs. Modern taggers use BiLSTM or Transformer encoders, often as part of multi-task models that jointly predict POS tags, morphological features, and syntactic structure.

Unknown Words

A significant challenge in POS tagging is handling unknown words not seen in training. Techniques include suffix/prefix features (words ending in "-tion" are likely nouns), word shape features (capitalization patterns), subword embeddings (character-level CNNs or LSTMs), and pre-trained contextual embeddings that generalize to unseen words through their subword representations.

Role in NLP Pipelines

POS tags serve as features for nearly every higher-level NLP task. Parsers use POS tags to constrain the space of possible syntactic analyses. NER systems use POS patterns to identify entity boundaries. Information retrieval systems use POS tags to weight content words more heavily. Even in the era of end-to-end neural models, POS tagging remains relevant as an auxiliary training objective that provides useful inductive bias, and as an interpretability tool for understanding model behavior.

Tagsets

Methods

Role in NLP Pipelines

References

External Links

Tagsets

Methods

Role in NLP Pipelines

Related Topics

References

External Links