Computational Linguistics
About

Universal Dependencies

Universal Dependencies (UD) is a cross-linguistically consistent framework for morphosyntactic annotation, providing treebanks for over 100 languages with unified dependency relations and part-of-speech tags.

37 universal dependency relations; 17 universal POS tags (UPOS)

Universal Dependencies (UD) is an open community effort to create consistently annotated treebanks across languages, enabling meaningful cross-linguistic comparison and the development of multilingual parsing systems. The project defines a universal set of part-of-speech tags (UPOS), morphological features, and dependency relations that are intended to capture cross-linguistic regularities while accommodating language-specific phenomena through subtype extensions.

Annotation Scheme

UD Core Principles 17 Universal POS tags: NOUN, VERB, ADJ, ADV, ADP, AUX, ...
37 Universal dependency relations: nsubj, obj, iobj, obl, nmod, amod, ...

Content words as heads (not function words):
• Adpositions depend on nouns (case)
• Auxiliaries depend on lexical verbs (aux)
• Complementizers depend on clausal heads (mark)
• Determiners depend on nouns (det)

A central design decision in UD is that content words (nouns, verbs, adjectives, adverbs) are heads, while function words (adpositions, auxiliaries, determiners, complementizers) are dependents. This lexicalist approach differs from many traditional annotation schemes where function words head phrases (e.g., prepositions heading PPs). The rationale is that content-word-headed trees are more parallel across typologically diverse languages, since function words vary widely while predicate-argument structure is more universal.

Multilingual Scope

As of version 2.14, UD includes over 240 treebanks covering more than 140 languages from diverse language families including Indo-European, Sino-Tibetan, Afro-Asiatic, Uralic, Turkic, Dravidian, Austronesian, and many others. Treebank sizes range from a few hundred sentences for under-resourced languages to over 100,000 sentences for well-studied languages like Czech and Russian. The project uses a standardized CoNLL-U format for data storage and distribution.

CoNLL Shared Tasks
The CoNLL 2017 and 2018 shared tasks on multilingual dependency parsing used UD treebanks, spurring the development of multilingual and cross-lingual parsing systems. These tasks demonstrated that transfer learning, multilingual embeddings, and delexicalized parsing can achieve reasonable accuracy even for languages with no training data.

Impact on Multilingual NLP

UD has become the standard framework for multilingual syntactic analysis. It enables zero-shot and few-shot cross-lingual transfer, where a parser trained on one language is applied to another. It facilitates typological studies of word order, case marking, and agreement patterns across languages. The project also revealed systematic challenges in cross-linguistic annotation consistency, leading to ongoing refinements of the guidelines and automated validation tools.

Related Topics

References

  1. Nivre, J., de Marneffe, M.-C., Ginter, F., et al. (2020). Universal Dependencies v2: An evergrowing multilingual treebank collection. Proceedings of LREC 2020, 4034–4043. https://aclanthology.org/2020.lrec-1.497
  2. de Marneffe, M.-C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2), 255–308. https://doi.org/10.1162/coli_a_00402
  3. Zeman, D., et al. (2018). CoNLL 2018 shared task: Multilingual parsing from raw text to Universal Dependencies. Proceedings of the CoNLL 2018 Shared Task, 1–21. https://doi.org/10.18653/v1/K18-2001
  4. McDonald, R., Nivre, J., Quirmbach-Brundage, Y., et al. (2013). Universal dependency annotation for multilingual parsing. Proceedings of ACL 2013, 92–97. https://aclanthology.org/P13-2017

External Links