Coreference resolution is the task of identifying all expressions (mentions) in a text that refer to the same entity and grouping them into coreference clusters. Mentions include proper names ("Barack Obama"), pronouns ("he," "his"), and definite noun phrases ("the president"). Resolving coreference is essential for text understanding: without knowing that "Obama," "he," and "the president" refer to the same person, a system cannot correctly interpret the discourse or extract accurate information. Coreference resolution underlies virtually every NLP application that requires discourse-level understanding, including information extraction, question answering, summarisation, and machine translation.
Mention Detection and Pairwise Models
Mention-ranking: for each mention mᵢ, select antecedent:
â = argmax_{a ∈ {ε, m₁, ..., mᵢ₋₁}} P(a | mᵢ)
where ε represents "no antecedent" (start new entity)
Entity-level: score entire clustering configurations
The mention-pair model (Soon et al., 2001) was the foundational approach to coreference resolution. It trains a binary classifier to determine whether two mentions are coreferent, using features such as string matching, distance, number and gender agreement, semantic class compatibility, and syntactic role. At test time, the classifier is applied to all pairs of mentions, and transitive closure groups positively classified pairs into clusters. While simple, the mention-pair model suffers from inconsistency: it may decide that A is coreferent with B and B is coreferent with C, but not A with C, producing contradictory clusters that must be resolved heuristically.
Neural Coreference Resolution
The end-to-end neural coreference model of Lee et al. (2017) transformed the field by jointly learning mention detection and coreference linking without relying on hand-crafted features or a syntactic pipeline. The model considers all spans in the document as potential mentions, scores each span for mention-hood, and for each detected mention, scores all preceding spans as potential antecedents. The mention-ranking formulation naturally handles the transitivity problem, since each mention is linked to at most one antecedent, and clusters are formed by following antecedent links. SpanBERT-based extensions (Joshi et al., 2020) achieve F1 scores above 80% on the OntoNotes benchmark.
Coreference resolution systems exhibit systematic gender biases that reflect stereotypical associations in training data. Zhao et al. (2018) showed that models are more likely to resolve a pronoun to a gender-stereotypical antecedent — linking "nurse" with "she" and "engineer" with "he" — even when the text provides no gendered cues. The WinoBias and Winogender datasets test for such biases using carefully constructed sentence pairs that differ only in the gender of pronouns. Debiasing techniques, including data augmentation with gender-swapped examples and constrained decoding, have been developed to mitigate these biases.
Coreference resolution involves several linguistically challenging phenomena. Bridging anaphora (also called associative anaphora) links mentions through part-whole or set-member relationships rather than identity: "the car... the engine" refers to the engine of the previously mentioned car. Event coreference determines when different descriptions refer to the same event. Cross-document coreference links mentions across different documents, enabling corpus-level entity tracking. Each of these phenomena requires different types of knowledge and reasoning, and current systems handle identity coreference far better than bridging or event coreference.