Computational Linguistics
About

Naive Bayes for Text

Naive Bayes classifiers apply Bayes' theorem with a strong independence assumption between features to classify text documents, achieving surprisingly competitive performance despite the unrealistic assumption that words occur independently of one another given the class label.

P(c | d) ∝ P(c) ∏ᵢ P(wᵢ | c)

The Naive Bayes classifier is a probabilistic generative model that applies Bayes' theorem to compute the posterior probability of each class given a document, making the simplifying assumption that features (typically words) are conditionally independent given the class label. Despite this assumption being clearly violated in natural language — words are highly correlated with one another — Naive Bayes classifiers perform remarkably well on text classification tasks. This robustness arises because the classifier only needs to rank classes correctly, not estimate calibrated probabilities, and the independence assumption often preserves the correct ranking even when it distorts the magnitudes.

Multinomial and Bernoulli Models

Multinomial Naive Bayes P(c | d) ∝ P(c) ∏ᵢ₌₁ⁿ P(wᵢ | c)^{tf(wᵢ,d)}

where tf(wᵢ, d) is the term frequency of word wᵢ in document d

MLE with Laplace smoothing:
P(wᵢ | c) = (count(wᵢ, c) + 1) / (∑_w count(w, c) + |V|)

Two main variants of Naive Bayes are used for text. The multinomial Naive Bayes model treats a document as a bag of words and models word frequencies, making it natural for longer documents where word repetition carries information. The multivariate Bernoulli model instead represents documents as binary vectors indicating word presence or absence, ignoring frequency. McCallum and Nigam (1998) showed that the multinomial model generally outperforms the Bernoulli model for text classification, particularly on longer documents and larger vocabularies, because it exploits the additional information in word counts.

Strengths and Limitations

Naive Bayes has several practical advantages that explain its enduring popularity. Training requires only a single pass through the data to compute word counts per class, making it extremely fast — linear in the number of training documents and vocabulary size. The model is highly interpretable: the most discriminative features for each class can be identified by examining the likelihood ratios P(w | c₁) / P(w | c₂). It also performs well with small training sets, since the independence assumption acts as a strong regulariser that prevents overfitting.

Spam Filtering Pioneer

Naive Bayes became widely known through its application to email spam filtering. Sahami et al. (1998) demonstrated that a simple Naive Bayes classifier trained on word features could effectively distinguish spam from legitimate email. Paul Graham's 2002 essay "A Plan for Spam" popularised the approach and led to its adoption in major email clients. The success of Naive Bayes in spam filtering illustrated how a theoretically naive model could solve a practical problem of enormous scale.

The primary limitation of Naive Bayes is the independence assumption itself. Correlated features — such as the bigram "New York" — are treated as independent evidence, which can lead to overconfident predictions. Complement Naive Bayes (Rennie et al., 2003) addresses some of these issues by estimating parameters using data from all classes except the target class, yielding improved performance on imbalanced datasets. Nevertheless, for many text classification tasks, discriminative models such as logistic regression and SVMs achieve higher accuracy by modelling feature interactions implicitly.

Interactive Calculator

Enter labeled training examples (one per line, format label,text) followed by a blank line and a single test line to classify. The calculator trains a Naive Bayes classifier with Laplace smoothing and shows posterior probabilities for each class.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI Workshop on Learning for Text Categorization, 41–48.
  2. Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive Bayes text classifiers. Proceedings of ICML, 616–623.
  3. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  4. Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian approach to filtering junk e-mail. AAAI Workshop on Learning for Text Categorization, 55–62.

External Links