Computational Linguistics
About

Formant Analysis

Formant analysis identifies and tracks the resonant frequencies of the vocal tract, which are the primary acoustic cues for vowel identity and play a crucial role in characterizing consonantal articulation.

F_n = (2n − 1) c / (4L), n = 1, 2, 3, ...

Formants are the resonant frequencies of the vocal tract, appearing as peaks in the spectral envelope of voiced speech. They arise from the acoustic properties of the vocal tract as a resonating tube: as the speaker changes the shape of the tract through movements of the tongue, jaw, lips, and velum, the formant frequencies shift, producing the rich variety of vowels and consonants in human languages. Formant analysis — the automatic detection, estimation, and tracking of these resonant frequencies — is a foundational technique in acoustic phonetics, speech recognition, and speaker characterization.

Source-Filter Theory

Formant Frequencies Uniform tube model: F_n = (2n−1) · c / (4L)
c ≈ 35,000 cm/s (speed of sound in air)
L ≈ 17.5 cm (average male vocal tract length)

F1 ≈ 500 Hz, F2 ≈ 1500 Hz, F3 ≈ 2500 Hz

Vowel identity: primarily determined by F1 and F2
F1 inversely correlates with tongue height
F2 correlates with tongue advancement (front/back)

The source-filter model of speech production, formalized by Gunnar Fant in 1960, separates the glottal source (periodic vocal fold vibration for voiced sounds) from the vocal tract filter (the resonating cavity from larynx to lips). Formants are properties of the filter: the vocal tract transfer function has peaks at the resonant frequencies, amplifying harmonics of the fundamental frequency that fall near these peaks. This separation of source and filter is the theoretical basis for formant analysis and for many speech coding and synthesis techniques.

Formant Estimation Methods

The most common computational method for formant estimation is Linear Predictive Coding (LPC), which fits an all-pole model to the speech spectrum. The poles of the LPC transfer function correspond approximately to the formant frequencies and bandwidths. The LPC order determines how many resonances the model can capture: a typical order of 10-12 for narrowband speech (8 kHz sampling) models 5-6 formants. Root-finding on the LPC polynomial yields the pole locations, and those with frequencies in the expected formant ranges and sufficiently narrow bandwidths are selected as formant candidates.

The F1-F2 Vowel Space

Plotting the first formant (F1) against the second formant (F2) for different vowels produces the classic vowel space diagram, one of the most important visualizations in phonetics. The vowels of any language can be located in this two-dimensional space: high vowels like /i/ and /u/ have low F1, low vowels like /a/ have high F1; front vowels like /i/ and /e/ have high F2, back vowels like /u/ and /o/ have low F2. This mapping between articulatory configuration and acoustic output is remarkably consistent across speakers after normalization, and has been used to study dialect variation, language change, and second language acquisition.

Formant tracking across continuous speech is considerably more challenging than estimating formants in a single frame. Formants can merge, split, or cross as the vocal tract configuration changes rapidly during articulation. Dynamic programming algorithms that enforce continuity constraints, Kalman filtering approaches that model formant dynamics, and more recently deep learning methods that learn formant tracking from annotated data, all address this challenge. Reliable formant tracking remains an active research problem, particularly for high-pitched voices (children, some women) where the sparse harmonic structure complicates spectral envelope estimation.

Formant information is used in numerous applications beyond basic phonetic analysis. In speaker normalization, formant ratios help factor out vocal tract length differences between speakers. In forensic phonetics, formant measurements contribute to speaker comparison evidence. In clinical speech science, formant analysis quantifies articulatory imprecision in disordered speech. And in speech synthesis, controlling formant trajectories remains a key mechanism for producing intelligible and natural vowels.

Related Topics

References

  1. Fant, G. (1960). Acoustic Theory of Speech Production. Mouton & Co. doi:10.1515/9783110873429
  2. Deng, L., Cui, X., Pruber, R., & Huang, J. (2006). A database of vocal tract resonance trajectories for research in speech processing. Proc. ICASSP, 1, 369–372. doi:10.1109/ICASSP.2006.1660036
  3. Stevens, K. N. (1998). Acoustic Phonetics. MIT Press.
  4. Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels. The Journal of the Acoustical Society of America, 24(2), 175–184. doi:10.1121/1.1906875

External Links