Computational Linguistics
About

Semantic Textual Similarity

Semantic textual similarity (STS) measures the degree to which two text segments convey the same meaning, providing a continuous similarity score that underpins paraphrase detection, information retrieval, and evaluation of text generation systems.

STS(s1, s2) in [0, 5], predicted as cos(v_{s1}, v_{s2})

Semantic textual similarity (STS) is the task of assigning a continuous score indicating how similar in meaning two text segments are. Unlike textual entailment, which asks a directional yes/no question about logical implication, STS measures a symmetric, graded similarity. The STS Benchmark, introduced through the SemEval shared tasks (2012-2017), uses a 0-5 scale where 0 indicates completely dissimilar sentences and 5 indicates semantically equivalent sentences. STS is both an intrinsic evaluation of semantic representations and a building block for applications like deduplication, clustering, and retrieval.

Methods and Models

STS Computation Score prediction: STS(s1, s2) = cos(encode(s1), encode(s2)) · 5

Cross-encoder: STS(s1, s2) = MLP(BERT([CLS] s1 [SEP] s2 [SEP]))

Evaluation: Pearson / Spearman correlation with human judgments

STS-B scores: 0 (unrelated) to 5 (equivalent)
"A man is playing a guitar" vs "A man is playing music" → 3.8
"A cat sits on a mat" vs "A dog runs in a park" → 0.6

Two architectural paradigms dominate STS. Bi-encoders independently encode each sentence into a fixed vector and compute similarity as cosine distance, enabling efficient retrieval over large collections. Cross-encoders concatenate the two sentences and pass them jointly through a Transformer, computing a similarity score from the joint representation. Cross-encoders are more accurate because they model fine-grained interactions between the two sentences, but bi-encoders are orders of magnitude faster for retrieval because sentence vectors can be pre-computed and indexed.

Training Objectives

STS models are trained using several objectives. Regression training directly predicts the human similarity score using mean squared error loss. Contrastive learning, used in models like SimCSE, trains the encoder to produce similar vectors for semantically similar pairs and dissimilar vectors for unrelated pairs. Knowledge distillation transfers the quality of cross-encoder scores to more efficient bi-encoder models. Multi-task training on STS, NLI, and paraphrase detection datasets typically yields the strongest general-purpose similarity models.

STS vs. Paraphrase Detection

STS is closely related to but distinct from paraphrase detection. Paraphrase detection is a binary classification task (paraphrase or not), while STS provides a continuous score. The Microsoft Research Paraphrase Corpus (MRPC) and Quora Question Pairs (QQP) dataset are standard paraphrase benchmarks. In practice, STS scores can be thresholded to perform paraphrase detection, and paraphrase data can augment STS training. The PAWS dataset specifically tests adversarial cases where high lexical overlap does not imply semantic similarity.

Applications and Challenges

STS underpins many practical applications. In information retrieval, STS between a query and candidate passages determines relevance ranking. In machine translation evaluation, metrics like BERTScore compute STS between candidate and reference translations, correlating with human quality judgments better than n-gram overlap metrics like BLEU. In automatic essay scoring and plagiarism detection, STS identifies semantically similar passages regardless of surface form differences.

Challenges in STS include handling negation (sentences that differ by a single negation are lexically similar but semantically opposite), quantifier sensitivity ("all students passed" vs. "most students passed"), and domain shift (models trained on news text may perform poorly on clinical or legal text). Compositional generalization -- correctly scoring similarity for novel combinations of known words -- remains difficult, and models sometimes rely on superficial lexical cues rather than deep semantic comparison.

Interactive Calculator

Enter multiple documents, one per line. The calculator computes TF-IDF vectors for each document and pairwise cosine similarity between all document pairs.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). SemEval-2017 Task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation. In Proceedings of SemEval (pp. 1–14). doi:10.18653/v1/S17-2001
  2. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of EMNLP-IJCNLP (pp. 3982–3992). doi:10.18653/v1/D19-1410
  3. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. In Proceedings of ICLR.
  4. Agirre, E., Cer, D., Diab, M., & Gonzalez-Agirre, A. (2012). SemEval-2012 Task 6: A pilot on semantic textual similarity. In Proceedings of SemEval (pp. 385–393).

External Links