Computational Linguistics
About

Morphological Parsing

Morphological parsing decomposes a word into its constituent morphemes and assigns grammatical labels to each part, producing structured analyses that expose the lemma, part of speech, and inflectional features encoded in the word form.

parse(w) → ⟨lemma, POS, features⟩

Morphological parsing is the process of assigning a complete structural analysis to a word form, identifying its component morphemes, the lemma (dictionary form) from which it derives, its part of speech, and the morphosyntactic features it carries. For example, parsing the English word "unhappiness" should yield something like un+happy+ness with the analysis: lemma "happy," prefix "un-" (negation), suffix "-ness" (nominalization), category Noun. For morphologically rich languages, the parse encodes a wealth of information — case, number, gender, tense, aspect, mood, person — that is crucial for syntactic and semantic processing.

Approaches to Morphological Parsing

Morphological Parse Representation parse("running") = run+ing [Verb, Gerund]
parse("dogs") = dog+s [Noun, Plural]
parse("unkindness") = un+kind+ness [Noun, Neg+Adj→Noun]

Full feature structure:
parse("corrían") = corr+ían
[Verb: lemma=correr, tense=Imperfect, mood=Indicative,
person=3, number=Plural]

Rule-based morphological parsers typically combine a finite-state transducer encoding the morphotactic and phonological grammar with a lexicon of stems and affixes. The transducer maps surface forms to lexical representations, and a subsequent lookup assigns morphosyntactic tags. Systems like XFST, HFST, and Apertium's lttoolbox implement this approach for many languages. These parsers achieve high accuracy and coverage but require substantial linguistic expertise to develop.

Data-Driven and Neural Approaches

Supervised machine learning approaches to morphological parsing train classifiers on annotated corpora. The CoNLL-SIGMORPHON shared tasks have evaluated systems on morphological analysis and inflection generation across dozens of typologically diverse languages. Neural encoder-decoder models, particularly character-level sequence-to-sequence architectures with attention, have achieved competitive results. These models learn to map from characters to morphological feature bundles without requiring explicit morpheme boundaries, though they may struggle with rare forms and long-distance dependencies.

Morphological Parsing for Arabic

Arabic presents a particularly challenging morphological parsing problem because of its root-and-pattern (templatic) morphology, where consonantal roots are interleaved with vocalic patterns. The word "kataba" (he wrote) combines the root k-t-b with the pattern _a_a_a, while "kutub" (books) combines the same root with the pattern _u_u_. Systems like BAMA (Buckwalter Arabic Morphological Analyzer) and MADAMIRA handle this nonconcatenative morphology through specialized lookup tables and compatibility constraints, generating full analyses including diacritics, POS, and features.

Morphological Disambiguation

Because many word forms are morphologically ambiguous — a single surface form may admit multiple parses — morphological parsing systems must include a disambiguation component. In Turkish, for instance, the form "adam" can be parsed as either the noun "man" in the nominative or a proper name. Disambiguation can be performed using hidden Markov models, conditional random fields, or neural sequence labelers that consider the sentential context. The UniMorph project provides cross-linguistic morphological annotations that facilitate the training of multilingual disambiguation systems.

Morphological parsing feeds directly into many downstream tasks. In machine translation, morphological analysis helps align words across languages with different morphological complexity. In information retrieval, reducing inflected forms to lemmas improves recall. In syntactic parsing, morphological features constrain the set of possible syntactic analyses. The quality of morphological parsing thus has cascading effects throughout the NLP pipeline.

Related Topics

References

  1. Habash, N., Rambow, O., & Roth, R. (2009). MADA+TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, and lemmatization. Proceedings of the 2nd International Conference on Arabic Language Resources and Tools.
  2. Cotterell, R., Kirov, C., Sylak-Glassman, J., Walther, G., Vylomova, E., Xia, P., ... & Hulden, M. (2018). The CoNLL-SIGMORPHON 2018 shared task: Universal morphological reinflection. Proceedings of CoNLL-SIGMORPHON, 1–27. doi:10.18653/v1/K18-3001
  3. Buckwalter, T. (2004). Buckwalter Arabic morphological analyzer version 2.0. Linguistic Data Consortium, University of Pennsylvania.

External Links