The speech component of human-computer interaction has gained considerable momentum and attention in recent years. However, progress in advanced computer speech interfaces is limited in part due to incomplete knowledge of the physics of speech production. For computer generated speech output, this means limitations in the naturalness and intelligibility of synthetic speech. Unvoiced speech sounds such as fricatives are an important example. These sounds are produced by ``turbulent'' air motion in the vocal tract.
A proper understanding of how unvoiced sounds are produced is thus far lacking because the speech community has for the most part limited its physical picture of air motion in the vocal system to only acoustic motion. Existing models for unvoiced sounds focus on supplementing the well established plane-wave acoustic transmission model for the vocal tract with a random noise source. The modeling problem is thus reduced to one of estimating source spectrum, level, impedance, and spatial distribution. However, these characteristics of the source as a function of vocal tract shape, lung pressure, and other speech parameters is not at all clear.
At present, the fundamental challenges of characterizing unvoiced sound sources remains. This is due in part to a limited understanding within the speech community of the fundamental physical mechanisms involved. Thankfully, these mechanisms have been the focus of much attention in other fields of study, particularly aeroacoustics and aerodynamics. As a result, there is no shortage of either theoretical foundations or physical evidence on which to base the development of models suitable for speech production. Some efforts to apply these developments to speech have been undertaken, notably by McGowan (1988), Hirschberg (1992), and Shadle (1999). However, the following questions have yet to be sufficiently addressed: (1) How can non-acoustic motion in speech be characterized? (2) How can the sound generated by this type of motion be predicted? and (3) How can this be done in a manner suitable for speech synthesis (i.e., automatically and efficiently)?
This research aims to (1) describe the physical production mechanism for the fricative sound source in terms of aeroacoustic theory, and (2) implement a model for this mechanism in an articulatory speech synthesizer. Two component models set the new fricative model apart from previous models. These are:
In the flow facility, air from a building air supply is smoothed and quieted by a muffler, honeycomb, and fine mesh screens. The air then passes through a 16:1 area ratio nozzle before entering the main test section. A jet is formed at the inlet and impinges on an obstacle downstream from the jet formation. A picture and a schematic of the pipe flow facility are shown below. Only the main test section is shown in the schematic diagram.
Included in the table below are audio demonstrations resulting from simulations of the facility with the new fricative model using different Strouhal numbers. A Strouhal number of one (St=1) is typical and should be compared with the measured sound. However, simulations at lower Strouhal numbers are illustrative since they reveal the nature of the vortex generated sound. The figure below shows a spectral comparison of the sound measured at the outlet of the flow facility and the simulated outlet sound (with St=1). Note that these are power spectrum estimates and the peaks result from resonances of the 129cm-long duct. QuickTime movies for the St=1 simulation (one with audio annotation, one without) are available in the Demo section below showing the position of vortex rings (computed by the jet model) as they convect downstream through the duct. At lower Strouhal numbers, the sound due to individual vortices can be heard as they pass the obstacle. This is demonstrated particularly well in the St=0.01 audio demo.
The table below contains audio demonstrations of both sustained fricatives and fricatives in a vowel context (/aCa/). QuickTime movies are included for two examples showing the positions of vortex rings computed by the jet model. In the sustained /z/ demo, be sure to note the pitch synchronous vortex shedding at the velar constriction. The fricative model is also capable of producing unvoiced sound due to the release of plosives. Examples of these are included in the audio demos as well. Finally, the potential for using this model to generate aspirative noise near the vocal folds is demonstrated by the "shedding at the vocal folds" movie in the table. This movie shows the simulated vortex ring positions during several cycles of vocal fold vibration. More accurate articulatory models for the vocal tract shape are needed to generate natural sounding aspirative noise.
Download the full abstract: ASCII, PostScript
or, send your request for a copy of my entire thesis to sinder-at-ieee.org.
The table below provides some audio and video demonstrations of both validation and speech simulations discussed in my thesis.
| Description | Sun Audio File | QuickTime Movie (* with audio annotation) |
|---|---|---|
| Flow facility measurement (jet speed = 11.9m/s) | ||
| Flow facility simulation (jet speed = 11.9m/s; St = 0.01) | ||
| Flow facility simulation (jet speed = 11.9m/s; St = 0.02) | ||
| Flow facility simulation (jet speed = 11.9m/s; St = 0.2) | ||
| Flow facility simulation (jet speed = 11.9m/s; St = 1) | movie1* (32MB), movie2 (17.8MB) | |
| Flow facility simulation (jet speed = 11.9m/s; St = 2) | ||
| sustained /s/ | ||
| sustained /sh/ | ||
| sustained /z/ | movie (14.6MB) | |
| /asa/ | movie* (17.4MB) | |
| /asha/ | ||
| /aza/ | ||
| /ata/ | ||
| /ada/ | ||
| shedding at the vocal folds | movie (12.3MB) |