Information Extraction

Information extraction automatically identifies and structures factual information from unstructured text, transforming natural language into machine-readable representations such as entity-relation triples, event records, and knowledge base entries.

IE: text → {(entity₁, relation, entity₂), ...}

Information extraction (IE) is the task of automatically extracting structured information from unstructured or semi-structured text. The goal is to convert natural language into formal representations — typically entity-relation triples, event records, or slot-value pairs — that can be stored in databases, queried programmatically, and reasoned over by downstream systems. IE encompasses a family of subtasks including named entity recognition, relation extraction, event extraction, coreference resolution, and temporal information extraction, each targeting different aspects of the factual content expressed in text.

Named Entity Recognition

BIO Tagging for Named Entity Recognition Input: "Marie Curie discovered radium in Paris"
Tags: B-PER I-PER O O O B-LOC

Sequence labelling: ŷ = argmax_y P(y | x)
CRF: P(y | x) = (1/Z) exp(∑ᵢ ∑ₖ λₖ fₖ(yᵢ₋₁, yᵢ, x, i))

Named entity recognition (NER) identifies mentions of entities in text and classifies them into predefined categories such as person, organisation, location, date, and numeric expression. NER is typically formulated as a sequence labelling task using the BIO (Beginning, Inside, Outside) tagging scheme, where each token receives a tag indicating whether it begins an entity of a given type, continues an entity, or is outside any entity. Conditional random fields (CRFs) and BiLSTM-CRF architectures were the dominant approach before pretrained transformers. Modern NER systems based on BERT and similar models achieve F1 scores above 93% on standard benchmarks such as CoNLL-2003.

End-to-End Extraction Pipelines

Practical information extraction systems combine multiple components into extraction pipelines. A typical pipeline processes text through sentence segmentation, tokenisation, part-of-speech tagging, NER, coreference resolution, and relation extraction, with each component building on the output of previous stages. Pipeline architectures are simple and modular but suffer from error propagation: mistakes in early stages compound through the pipeline. Joint models that perform multiple extraction tasks simultaneously can mitigate this problem by sharing representations and allowing information to flow between subtasks.

The MUC Conferences

Information extraction as a research field was largely shaped by the Message Understanding Conferences (MUC), a series of evaluations run by DARPA from 1987 to 1998. The MUC tasks defined template-filling problems: given a corpus of news articles about events such as terrorist attacks or corporate acquisitions, extract structured records with predefined slots (perpetrator, target, instrument, date, location). The MUC evaluations established the IE task definitions, evaluation metrics (precision, recall, F1), and competitive evaluation methodology that continue to define the field.

Knowledge base population (KBP) extends information extraction from individual documents to corpus-level knowledge acquisition. The goal is to extract facts from large text corpora and populate structured knowledge bases such as Wikidata or Freebase. KBP requires resolving entity mentions across documents (entity linking), combining evidence from multiple sources, and handling contradictory or uncertain information. The TAC KBP evaluations have driven progress in this area, promoting research on slot filling (extracting attribute values for entities), entity discovery and linking, and belief and sentiment extraction from text.

Named Entity Recognition

End-to-End Extraction Pipelines

References

External Links

Named Entity Recognition

End-to-End Extraction Pipelines

Related Topics

References

External Links