Neural language models represent a paradigm shift from count-based to prediction-based approaches to language modeling. Rather than storing and smoothing n-gram counts, a neural language model maps words to dense vector representations (embeddings) and uses a neural network to compute the probability of the next word given its context. This approach, introduced by Bengio et al. (2003), addresses two fundamental limitations of n-gram models: the curse of dimensionality (the exponential growth of parameters with context length) and the inability to generalize across semantically similar contexts.
The Neural Probabilistic Language Model
Embedding: x = [C(wₜ₋ₙ₊₁); ...; C(wₜ₋₁)]
Hidden: h = tanh(Hx + d)
Output: y = softmax(Wx + Uh + b)
P(wₜ = j | context) = yⱼ = exp(yⱼ) / Σₖ exp(yₖ)
In Bengio's architecture, each word in the vocabulary is mapped to a dense vector of dimension m through a shared embedding matrix C. The context window of n-1 word embeddings is concatenated and passed through a hidden layer with tanh activation, followed by a softmax output layer over the entire vocabulary. The model is trained to maximize the log-likelihood of the training data using stochastic gradient descent. A critical feature is that the embedding matrix C is shared across all positions, so the model learns a single representation for each word that must work well in all contexts.
Advantages over N-Gram Models
The fundamental advantage of neural language models is their ability to generalize. If the model has seen "the cat sat on the mat" during training, it can assign reasonable probability to "the dog sat on the mat" even if this exact sentence never occurred, because "cat" and "dog" share similar embeddings. This generalization through distributed representations solves the core problem of n-gram models, which treat each word as an atomic symbol with no notion of similarity. Neural models also scale better with context length because the number of parameters grows linearly rather than exponentially with the window size.
Computing the softmax over a large vocabulary (often 50,000-100,000 words) is the computational bottleneck of neural language models, requiring a matrix multiplication with the full vocabulary at every time step. This has motivated various approximation techniques including hierarchical softmax, noise contrastive estimation (NCE), negative sampling, and adaptive softmax. These methods trade off exactness for speed, with some (like adaptive softmax) providing order-of-magnitude speedups with minimal degradation in model quality.
Evolution and Impact
The feedforward neural language model of Bengio et al. was followed by recurrent architectures (Mikolov et al., 2010) that removed the fixed context window, LSTM-based models (Sundermeyer et al., 2012) that captured long-range dependencies, and ultimately transformer-based models (Vaswani et al., 2017) that achieved state-of-the-art results through self-attention mechanisms. Each architectural advance brought substantial perplexity reductions: from roughly 150 for smoothed trigrams to 80-90 for feedforward neural models, 60-70 for RNN models, and below 20 for modern transformers on standard benchmarks.
The success of neural language models has had far-reaching consequences for NLP. The word embeddings learned as a byproduct of language modeling proved valuable as general-purpose representations for downstream tasks, inspiring Word2Vec, GloVe, and the broader paradigm of representation learning. The pre-training revolution, in which large language models trained on massive corpora are fine-tuned for specific tasks, can be seen as the culmination of the neural language modeling approach, demonstrating that learning to predict the next word is a powerful general-purpose objective for acquiring linguistic knowledge.