Computational Linguistics
About

Natural Language Inference

Natural language inference (NLI) is the task of determining whether a hypothesis sentence is entailed by, contradicts, or is neutral with respect to a premise sentence, serving as a fundamental test of natural language understanding.

NLI(premise, hypothesis) in {entailment, contradiction, neutral}

Natural language inference (NLI), also known as recognizing textual entailment (RTE), is the task of determining the logical relationship between two sentences: a premise and a hypothesis. The model must classify the relationship as entailment (the premise supports the hypothesis), contradiction (the premise refutes the hypothesis), or neutral (the premise neither supports nor refutes the hypothesis). NLI is considered a central test of natural language understanding because it requires lexical knowledge, syntactic analysis, world knowledge, and logical reasoning.

Datasets and Formulation

NLI Examples Premise: "A man is playing a guitar on stage."
Hypothesis: "A person is performing music." → Entailment

Premise: "Two dogs are running on the beach."
Hypothesis: "The cats are sleeping indoors." → Contradiction

Premise: "A woman is reading a book in the park."
Hypothesis: "The woman is a student." → Neutral

Classification: P(y | premise, hypothesis), y ∈ {E, C, N}

The Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) introduced the modern three-way classification formulation with 570,000 human-written sentence pairs. The Multi-Genre NLI (MultiNLI) corpus extends SNLI to 433,000 pairs across ten genres. ANLI (Adversarial NLI) provides progressively harder examples collected through a human-in-the-loop adversarial process. Earlier RTE challenges (2005-2011) used a binary entailment/non-entailment formulation with smaller datasets derived from real-world NLP application scenarios.

Models and Methods

Early NLI systems used feature engineering with lexical overlap, WordNet relations, and syntactic alignment features. The ESIM (Enhanced Sequential Inference Model) introduced cross-attention between premise and hypothesis representations, achieving strong results with BiLSTMs. BERT and subsequent pre-trained Transformers advanced NLI substantially, with cross-encoder architectures that jointly process the premise-hypothesis pair achieving state-of-the-art accuracy above 90% on SNLI. However, performance on adversarial and compositionally challenging test sets remains much lower.

Artifacts and Adversarial Evaluation

NLI datasets contain annotation artifacts -- spurious statistical patterns that models can exploit without genuine reasoning. Gururangan et al. (2018) showed that a hypothesis-only model (ignoring the premise entirely) achieves 67% accuracy on SNLI, far above the 33% random baseline. Negation words strongly predict contradiction, and vague terms like "animal" or "person" predict entailment. The HANS dataset tests whether models rely on syntactic heuristics (lexical overlap, subsequence, constituent), revealing systematic failures in models trained on SNLI/MultiNLI.

Applications and Connections

NLI training data and models serve as building blocks throughout NLP. NLI-trained sentence encoders (InferSent, Sentence-BERT) produce high-quality sentence embeddings for downstream tasks. Zero-shot text classification uses NLI: given a document as premise and a label description as hypothesis, the entailment score indicates the relevance of the label. Fact verification systems frame claim checking as NLI: the claim is the hypothesis and the evidence passage is the premise. In summarization evaluation, NLI-based metrics check whether the summary is entailed by the source document (factual consistency).

NLI also connects to formal semantics through the FraCaS test suite, which evaluates systems on linguistically precise entailment patterns involving quantifiers, plurals, anaphora, and temporal expressions. Bridging the gap between the broad coverage of neural NLI models and the precision of formal semantic reasoning remains an important research direction. Hybrid approaches that combine neural language understanding with symbolic reasoning show promise for achieving both coverage and systematic generalization.

Related Topics

References

  1. Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In Proceedings of EMNLP (pp. 632–642). doi:10.18653/v1/D15-1075
  2. Williams, A., Nangia, N., & Bowman, S. R. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL-HLT (pp. 1112–1122). doi:10.18653/v1/N18-1101
  3. Chen, Q., Zhu, X., Ling, Z., Wei, S., Jiang, H., & Inkpen, D. (2017). Enhanced LSTM for natural language inference. In Proceedings of ACL (pp. 1657–1668). doi:10.18653/v1/P17-1152
  4. Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., & Kiela, D. (2020). Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of ACL (pp. 4885–4901). doi:10.18653/v1/2020.acl-main.441

External Links