Universal Dependencies

Universal Dependencies (UD) is an open community effort to create consistently annotated treebanks across languages, enabling meaningful cross-linguistic comparison and the development of multilingual parsing systems. The project defines a universal set of part-of-speech tags (UPOS), morphological features, and dependency relations that are intended to capture cross-linguistic regularities while accommodating language-specific phenomena through subtype extensions.

Annotation Scheme

UD Core Principles 17 Universal POS tags: NOUN, VERB, ADJ, ADV, ADP, AUX, ...
37 Universal dependency relations: nsubj, obj, iobj, obl, nmod, amod, ...

Content words as heads (not function words):
• Adpositions depend on nouns (case)
• Auxiliaries depend on lexical verbs (aux)
• Complementizers depend on clausal heads (mark)
• Determiners depend on nouns (det)

A central design decision in UD is that content words (nouns, verbs, adjectives, adverbs) are heads, while function words (adpositions, auxiliaries, determiners, complementizers) are dependents. This lexicalist approach differs from many traditional annotation schemes where function words head phrases (e.g., prepositions heading PPs). The rationale is that content-word-headed trees are more parallel across typologically diverse languages, since function words vary widely while predicate-argument structure is more universal.

Multilingual Scope

As of version 2.14, UD includes over 240 treebanks covering more than 140 languages from diverse language families including Indo-European, Sino-Tibetan, Afro-Asiatic, Uralic, Turkic, Dravidian, Austronesian, and many others. Treebank sizes range from a few hundred sentences for under-resourced languages to over 100,000 sentences for well-studied languages like Czech and Russian. The project uses a standardized CoNLL-U format for data storage and distribution.

CoNLL Shared Tasks

The CoNLL 2017 and 2018 shared tasks on multilingual dependency parsing used UD treebanks, spurring the development of multilingual and cross-lingual parsing systems. These tasks demonstrated that transfer learning, multilingual embeddings, and delexicalized parsing can achieve reasonable accuracy even for languages with no training data.

Impact on Multilingual NLP

UD has become the standard framework for multilingual syntactic analysis. It enables zero-shot and few-shot cross-lingual transfer, where a parser trained on one language is applied to another. It facilitates typological studies of word order, case marking, and agreement patterns across languages. The project also revealed systematic challenges in cross-linguistic annotation consistency, leading to ongoing refinements of the guidelines and automated validation tools.

Annotation Scheme

Multilingual Scope

Impact on Multilingual NLP

References

External Links

Annotation Scheme

Multilingual Scope

Impact on Multilingual NLP

Related Topics

References

External Links