Neural Dependency Parsing

Neural dependency parsing applies deep neural networks to predict dependency trees, replacing the hand-crafted feature templates of classical statistical parsers with learned distributed representations. The shift began with Chen and Manning (2014), who showed that a simple feed-forward neural network over dense word and tag embeddings could match the accuracy of heavily feature-engineered parsers while being significantly faster. Subsequent work using recurrent and attention-based architectures has pushed accuracy to new heights.

Biaffine Attention Parser

Deep Biaffine Attention (Dozat & Manning, 2017) h₁, ..., h_n = BiLSTM(w₁, ..., w_n)

Head representations: h^arc-head_i = MLP^arc-head(h_i)
Dep representations: h^arc-dep_j = MLP^arc-dep(h_j)

s_arc(i, j) = h^arc-head_i^T U h^arc-dep_j + W(h^arc-head_i ⊕ h^arc-dep_j) + b

Tree decoding: Chu-Liu/Edmonds or Eisner algorithm

The biaffine attention parser of Dozat and Manning (2017) has become the dominant architecture for neural dependency parsing. It uses a multi-layer BiLSTM to produce contextual word representations, then applies separate MLPs to produce head and dependent representations. Arc scores are computed using a biaffine function that captures the interaction between potential head-dependent pairs. Label scores are computed similarly, conditioned on the predicted arc. The highest-scoring tree is found using the Eisner or Chu-Liu/Edmonds algorithm.

Transformer-Based Parsers

More recent work replaces the BiLSTM encoder with pre-trained Transformer models. Using BERT, XLNet, or other large language models as the encoder provides richer contextual representations and better generalization, especially for long-distance dependencies and rare constructions. The biaffine scoring layer remains the same; only the encoder changes. These Transformer-based parsers achieve labeled attachment scores (LAS) above 96% for English and have set new records across most UD treebanks.

Transition vs. Graph-Based Neural Parsing

Both transition-based and graph-based paradigms have benefited from neural representations. Neural transition-based parsers (Kiperwasser & Goldberg, 2016) use BiLSTM features for action classification. Neural graph-based parsers (Dozat & Manning, 2017) use biaffine attention. In practice, graph-based neural parsers tend to achieve slightly higher accuracy, while transition-based parsers are faster, though the gap has narrowed.

Multilingual and Cross-Lingual Parsing

Neural dependency parsers, especially those using multilingual pre-trained models like mBERT and XLM-R, have enabled strong cross-lingual transfer. A parser trained on English UD data with multilingual embeddings can parse other languages with reasonable accuracy even without target-language training data. Joint multilingual training, where a single parser is trained on treebanks from many languages simultaneously, further improves performance for low-resource languages by sharing structural knowledge across typologically similar languages.

Biaffine Attention Parser

Transformer-Based Parsers

Multilingual and Cross-Lingual Parsing

References

External Links

Biaffine Attention Parser

Transformer-Based Parsers

Multilingual and Cross-Lingual Parsing

Related Topics

References

External Links