Neural dependency parsing applies deep neural networks to predict dependency trees, replacing the hand-crafted feature templates of classical statistical parsers with learned distributed representations. The shift began with Chen and Manning (2014), who showed that a simple feed-forward neural network over dense word and tag embeddings could match the accuracy of heavily feature-engineered parsers while being significantly faster. Subsequent work using recurrent and attention-based architectures has pushed accuracy to new heights.
Biaffine Attention Parser
Head representations: harc-headi = MLParc-head(hi)
Dep representations: harc-depj = MLParc-dep(hj)
sarc(i, j) = harc-headiT U harc-depj + W(harc-headi ⊕ harc-depj) + b
Tree decoding: Chu-Liu/Edmonds or Eisner algorithm
The biaffine attention parser of Dozat and Manning (2017) has become the dominant architecture for neural dependency parsing. It uses a multi-layer BiLSTM to produce contextual word representations, then applies separate MLPs to produce head and dependent representations. Arc scores are computed using a biaffine function that captures the interaction between potential head-dependent pairs. Label scores are computed similarly, conditioned on the predicted arc. The highest-scoring tree is found using the Eisner or Chu-Liu/Edmonds algorithm.
Transformer-Based Parsers
More recent work replaces the BiLSTM encoder with pre-trained Transformer models. Using BERT, XLNet, or other large language models as the encoder provides richer contextual representations and better generalization, especially for long-distance dependencies and rare constructions. The biaffine scoring layer remains the same; only the encoder changes. These Transformer-based parsers achieve labeled attachment scores (LAS) above 96% for English and have set new records across most UD treebanks.
Multilingual and Cross-Lingual Parsing
Neural dependency parsers, especially those using multilingual pre-trained models like mBERT and XLM-R, have enabled strong cross-lingual transfer. A parser trained on English UD data with multilingual embeddings can parse other languages with reasonable accuracy even without target-language training data. Joint multilingual training, where a single parser is trained on treebanks from many languages simultaneously, further improves performance for low-resource languages by sharing structural knowledge across typologically similar languages.