Speaker recognition is the task of automatically determining the identity of a speaker from their voice. The field distinguishes two subtasks: speaker verification (accepting or rejecting a claimed identity based on a speech sample) and speaker identification (determining which speaker from a known set produced a given utterance). Speaker recognition exploits the fact that each person's voice has distinctive acoustic characteristics arising from the unique anatomy of their vocal tract, larynx, and nasal cavity, as well as learned speaking habits including accent, speaking rate, and intonation patterns.
Speaker Embeddings
Test: e_{test} = SpeakerEncoder(x_{test})
Score: s = cos(e_{enroll}, e_{test}) = (e_{enroll} · e_{test}) / (||e_{enroll}|| · ||e_{test}||)
Decision: Accept if s > θ, Reject otherwise
Metrics: EER (Equal Error Rate), minDCF
Modern speaker recognition systems represent each utterance as a fixed-dimensional embedding vector that captures speaker-discriminative information while being invariant to lexical content, channel conditions, and background noise. The i-vector framework, dominant from 2010 to 2017, used factor analysis to extract a low-dimensional representation from GMM sufficient statistics. The x-vector framework, introduced by Snyder et al. in 2018, replaced this with a time-delay neural network (TDNN) that processes variable-length utterances through frame-level layers, a statistics pooling layer, and segment-level layers to produce a speaker embedding.
Training and Scoring
Speaker embedding networks are trained with discriminative losses that encourage embeddings from the same speaker to be close together while pushing embeddings from different speakers apart. The additive angular margin (AAM) loss and its variants have proven particularly effective, learning embeddings that are well-separated on the unit hypersphere. Scoring typically uses cosine similarity between enrollment and test embeddings, often enhanced by Probabilistic Linear Discriminant Analysis (PLDA), which models within-speaker and between-speaker variability in the embedding space.
As speaker verification systems are deployed for authentication (phone banking, access control), they become targets for spoofing attacks. Replay attacks present recorded speech through a loudspeaker; text-to-speech and voice conversion attacks synthesize speech in the target speaker's voice. The ASVspoof challenge series has driven research into countermeasures that detect artifacts of replay, synthesis, and conversion. Modern systems deploy spoofing countermeasures alongside the speaker verification system, rejecting both impostors with different voices and attackers who synthesize the target voice.
Speaker diarization — determining "who spoke when" in a multi-speaker recording — extends speaker recognition to continuous audio streams. Diarization systems segment the audio into speaker-homogeneous regions and cluster them by speaker identity. Modern neural diarization approaches use speaker embeddings for clustering or directly predict speaker activities with end-to-end neural diarization (EEND) models that output frame-level speaker activity for each potential speaker.
The performance of speaker recognition systems is evaluated using the Equal Error Rate (EER), the operating point where the false acceptance rate equals the false rejection rate, and the minimum detection cost function (minDCF), which weights errors according to application-specific costs. On the VoxCeleb benchmark, current systems achieve EERs below 1%, demonstrating remarkable accuracy, though performance degrades with short utterances, noisy conditions, and cross-channel mismatches between enrollment and test audio.