Speech Synthesis Using an Aeroacoustic Fricative Model

by

Daniel J. Sinder

Thesis Advisor: Dr. James L. Flanagan
Date: October 1999

The following is a brief introduction to my research, providing only the essentials needed to get the gist of my work. A journal article is in the works. If you prefer more details or you would like to see my results, please see the section on obtaining details. On the other hand, you may just want to hear some neat demos.

Contents

  1. Motivation
  2. Approach
  3. Validation
  4. Speech Synthesis
  5. More Details
  6. Results & Demos
  7. Acknowledgments

1. Motivation

The speech component of human-computer interaction has gained considerable momentum and attention in recent years. However, progress in advanced computer speech interfaces is limited in part due to incomplete knowledge of the physics of speech production. For computer generated speech output, this means limitations in the naturalness and intelligibility of synthetic speech. Unvoiced speech sounds such as fricatives are an important example. These sounds are produced by ``turbulent'' air motion in the vocal tract.

A proper understanding of how unvoiced sounds are produced is thus far lacking because the speech community has for the most part limited its physical picture of air motion in the vocal system to only acoustic motion. Existing models for unvoiced sounds focus on supplementing the well established plane-wave acoustic transmission model for the vocal tract with a random noise source. The modeling problem is thus reduced to one of estimating source spectrum, level, impedance, and spatial distribution. However, these characteristics of the source as a function of vocal tract shape, lung pressure, and other speech parameters is not at all clear.

At present, the fundamental challenges of characterizing unvoiced sound sources remains. This is due in part to a limited understanding within the speech community of the fundamental physical mechanisms involved. Thankfully, these mechanisms have been the focus of much attention in other fields of study, particularly aeroacoustics and aerodynamics. As a result, there is no shortage of either theoretical foundations or physical evidence on which to base the development of models suitable for speech production. Some efforts to apply these developments to speech have been undertaken, notably by McGowan (1988), Hirschberg (1992), and Shadle (1999). However, the following questions have yet to be sufficiently addressed: (1) How can non-acoustic motion in speech be characterized? (2) How can the sound generated by this type of motion be predicted? and (3) How can this be done in a manner suitable for speech synthesis (i.e., automatically and efficiently)?

2. Approach

This research aims to (1) describe the physical production mechanism for the fricative sound source in terms of aeroacoustic theory, and (2) implement a model for this mechanism in an articulatory speech synthesizer. Two component models set the new fricative model apart from previous models. These are:

  1. a "jet model" which parameterizes the structure and evolution of vorticity in turbulent jets characteristic of unvoiced speech production
  2. a model model for estimating sound production due to interaction of the jet with the vocal tract geometry.
Both component models are based upon a wealth of information and theory available in aeroacoustics literature. In particular, the sound generation model is based upon Howe's (1975) theory for the generation of sound by vortices convected in a non-uniform duct. Furthermore, the models have been designed to capture only the physics essential for sound production and are thereby computationally efficient. The result is a fricative model which specifies the source as a function of flow conditions and the geometry of the vocal tract downstream from a constriction where a turbulent jet is formed. The model automatically produces a source with the appropriate strength, spectrum, location and impedance; this information does not need to be known a priori.

3. Validation

As with any numerical and/or reduced model, it is essential to evaluate the validity of the fricative model. The aeroacoustic-based model has been evaluated both by comparison to measurements made on a physical pipe-flow facility and by qualitative listening tests. The pipe flow facility was designed to generate sound in a fashion analogous to unvoiced fricative production --- a confined turbulent jet is directed at an obstacle in a duct. The dimensions of the flow facility are scaled up from those of the vocal tract to allow accurate measurement of aerodynamic properties.

In the flow facility, air from a building air supply is smoothed and quieted by a muffler, honeycomb, and fine mesh screens. The air then passes through a 16:1 area ratio nozzle before entering the main test section. A jet is formed at the inlet and impinges on an obstacle downstream from the jet formation. A picture and a schematic of the pipe flow facility are shown below. Only the main test section is shown in the schematic diagram.

Pipe Flow Facility Schematic

Included in the table below are audio demonstrations resulting from simulations of the facility with the new fricative model using different Strouhal numbers. A Strouhal number of one (St=1) is typical and should be compared with the measured sound. However, simulations at lower Strouhal numbers are illustrative since they reveal the nature of the vortex generated sound. The figure below shows a spectral comparison of the sound measured at the outlet of the flow facility and the simulated outlet sound (with St=1). Note that these are power spectrum estimates and the peaks result from resonances of the 129cm-long duct. QuickTime movies for the St=1 simulation (one with audio annotation, one without) are available in the Demo section below showing the position of vortex rings (computed by the jet model) as they convect downstream through the duct. At lower Strouhal numbers, the sound due to individual vortices can be heard as they pass the obstacle. This is demonstrated particularly well in the St=0.01 audio demo.

Spectral Comparison of
Measurement and Simulation

4. Speech Synthesis

To demonstrate the promise of this approach for speech synthesis, the new fricative model was implemented in a transmission-line articulatory speech synthesizer. It should be emphasized that the transmission-line computes acoustic wave propagation only (linear and planar at that). The new fricative model computes sound generation due to NON-ACOUSTIC motion and feeds that information to the acoustic model.

The table below contains audio demonstrations of both sustained fricatives and fricatives in a vowel context (/aCa/). QuickTime movies are included for two examples showing the positions of vortex rings computed by the jet model. In the sustained /z/ demo, be sure to note the pitch synchronous vortex shedding at the velar constriction. The fricative model is also capable of producing unvoiced sound due to the release of plosives. Examples of these are included in the audio demos as well. Finally, the potential for using this model to generate aspirative noise near the vocal folds is demonstrated by the "shedding at the vocal folds" movie in the table. This movie shows the simulated vortex ring positions during several cycles of vocal fold vibration. More accurate articulatory models for the vocal tract shape are needed to generate natural sounding aspirative noise.

5. More Details

Download the full abstract: ASCII, PostScript

or, send your request for a copy of my entire thesis to sinder-at-ieee.org.

6. Results & Demos

The table below provides some audio and video demonstrations of both validation and speech simulations discussed in my thesis.
Description Sun Audio File QuickTime Movie
(* with audio annotation)
Flow facility measurement (jet speed = 11.9m/s) audio
Flow facility simulation (jet speed = 11.9m/s; St = 0.01) audio
Flow facility simulation (jet speed = 11.9m/s; St = 0.02) audio
Flow facility simulation (jet speed = 11.9m/s; St = 0.2) audio
Flow facility simulation (jet speed = 11.9m/s; St = 1) audio movie1* (32MB), movie2 (17.8MB)
Flow facility simulation (jet speed = 11.9m/s; St = 2) audio
sustained /s/ audio
sustained /sh/ audio
sustained /z/ audio movie (14.6MB)
/asa/ audio movie* (17.4MB)
/asha/ audio
/aza/ audio
/ata/ audio
/ada/ audio
shedding at the vocal folds movie (12.3MB)

7. Acknowledgments

This work was conducted at CAIP at Rutgers University. Support was provided by the National Science Foundation (contracts IRI-9314946 and IIS-98-00999) and DARPA (contract DABT 63-93-C-0064).