Universal Dependencies (UD) is an open community effort to create consistently annotated treebanks across languages, enabling meaningful cross-linguistic comparison and the development of multilingual parsing systems. The project defines a universal set of part-of-speech tags (UPOS), morphological features, and dependency relations that are intended to capture cross-linguistic regularities while accommodating language-specific phenomena through subtype extensions.
Annotation Scheme
37 Universal dependency relations: nsubj, obj, iobj, obl, nmod, amod, ...
Content words as heads (not function words):
• Adpositions depend on nouns (case)
• Auxiliaries depend on lexical verbs (aux)
• Complementizers depend on clausal heads (mark)
• Determiners depend on nouns (det)
A central design decision in UD is that content words (nouns, verbs, adjectives, adverbs) are heads, while function words (adpositions, auxiliaries, determiners, complementizers) are dependents. This lexicalist approach differs from many traditional annotation schemes where function words head phrases (e.g., prepositions heading PPs). The rationale is that content-word-headed trees are more parallel across typologically diverse languages, since function words vary widely while predicate-argument structure is more universal.
Multilingual Scope
As of version 2.14, UD includes over 240 treebanks covering more than 140 languages from diverse language families including Indo-European, Sino-Tibetan, Afro-Asiatic, Uralic, Turkic, Dravidian, Austronesian, and many others. Treebank sizes range from a few hundred sentences for under-resourced languages to over 100,000 sentences for well-studied languages like Czech and Russian. The project uses a standardized CoNLL-U format for data storage and distribution.
Impact on Multilingual NLP
UD has become the standard framework for multilingual syntactic analysis. It enables zero-shot and few-shot cross-lingual transfer, where a parser trained on one language is applied to another. It facilitates typological studies of word order, case marking, and agreement patterns across languages. The project also revealed systematic challenges in cross-linguistic annotation consistency, leading to ongoing refinements of the guidelines and automated validation tools.