Named Entity Recognition

Named entity recognition (NER) is the task of locating and classifying named entities in unstructured text into predefined semantic categories. The most common categories are person (PER), organization (ORG), location (LOC), and miscellaneous (MISC), though domain-specific NER systems may recognize entities like gene names, drug names, chemical compounds, or legal citations. NER is a foundational component of information extraction pipelines and serves as input to relation extraction, knowledge base population, and question answering.

Sequence Labeling Formulation

NER as Sequence Labeling IOB2 encoding:
Barack/B-PER Obama/I-PER was/O born/O in/O Honolulu/B-LOC

IOBES encoding:
Barack/B-PER Obama/E-PER was/O born/O in/O Honolulu/S-LOC

Nested NER: [ORG [LOC New York] Times]
Requires span-based or hypergraph models

Like chunking, NER is typically formulated as a sequence labeling task using IOB encoding. Each token is assigned a tag indicating whether it begins (B), continues (I), or is outside (O) an entity of a given type. This formulation works well for flat, non-overlapping entities but cannot handle nested entities (e.g., "New York" as a location inside "New York Times" as an organization). Nested NER requires span-based, hypergraph, or sequence-to-sequence approaches.

Methods

NER systems have evolved through three generations. Rule-based and gazeteer-based systems used handcrafted patterns and dictionaries. Statistical systems, particularly CRFs with hand-crafted features (orthographic patterns, word shape, gazetteers), dominated from the mid-2000s. The current state of the art uses neural architectures: BiLSTM-CRF models (Lample et al., 2016) with character-level embeddings, and more recently, fine-tuned pre-trained language models like BERT that achieve F1 scores above 93% on the CoNLL-2003 English benchmark.

Domain Adaptation

NER models trained on news text often perform poorly on biomedical, legal, or social media text due to domain shift. Domain adaptation techniques include continued pre-training on domain text, few-shot learning, active learning, and data augmentation. Specialized biomedical NER models recognize entities like genes, proteins, diseases, and chemicals.

Evaluation and Challenges

NER evaluation uses entity-level F1: a predicted entity is correct only if both its boundaries and type match the gold standard exactly. The CoNLL-2003 shared task datasets (English and German) remain the most widely used benchmarks. Key challenges include recognizing entities in informal text (social media, conversational language), handling rare and emerging entities not seen in training, multilingual and cross-lingual NER, and resolving entity ambiguity (e.g., "Washington" as person, location, or organization).

Sequence Labeling Formulation

Methods

Evaluation and Challenges

References

External Links

Sequence Labeling Formulation

Methods

Evaluation and Challenges

Related Topics

References

External Links