Cluster II. Hypercomputing in Design Tasks Supported by Computational Fluid Dynamics (CFD)

Area II.3 Design of 'voice mimic' speech generation systems

Area Coordinator:
The Adaptive Voice Mimic System

I. Introduction

The research on the adaptive voice mimic aims to advance fundamental understanding of human speech generation and coalesces the problems of speech synthesis, speech recognition, and low bit-rate speech coding into a compact parametric framework. At its core, the mimic system utilizes optimization techniques and a computationally-intensive model of speech generation to provide a high quality estimate, moment by moment, of articulatory parameters from an acoustic speech signal. The estimation of articulatory parameters is accomplished through a two-step process: an open-loop (table look-up based) initial estimation followed by a closed-loop optimization refinement.

II. Articulatory Shape Estimation

Starting from an acoustic input, the open-loop (i.e., with no optimization) estimate of the articulatory parameters is obtained via a table look-up of precomputed synthetic speech representations. Each element in the table is stored with the articulatory parameters from which it was produced. The input speech is compared with the synthetic speech in the table via a spectral representation, and the articulatory shape corresponding to the ``closest'' synthetic speech is selected. Once initial articulatory estimates are found for a series of speech segments, a dynamic programming module provides smooth articulatory trajectories by imposing articulatory constraints. This concludes the open-loop process.

The open-loop estimates, initialize a closed-loop optimization by suggesting a starting position which is likely in the vicinity of the (global) optimal solution. Effective open-loop estimates reduce the computation required by the computationally-costly optimization loop. Within the closed-loop optimization, synthetic speech is generated from a compact set of articulatory parameters and compared with the input speech using a perceptually weighted distance metric. The articulatory parameters are iteratively adjusted based on the result of the comparison so that the weighted spectral distance between the arbitrary speech input and the synthetic speech is driven to below a preset threshold.

III. Articulatory Speech Synthesis

As part of this research, two methods of high quality speech synthesis from articulatory parameters are studied. The first method is based on linear acoustic theory/models of speech production, and the second method is based on a fluid-dynamic formulation. The techniques for the first method are relatively well established, but the method assumes plane wave propagation inside the vocal tract and also neglects most of the non-linear terms. On the other hand, the second, new method attempts to capture more accurately the physics behind human speech production. This is done by formulating the speech production process as a fluid-dynamic phenomenon. The approach uses a form of the Reynolds-Averaged Navier-Stokes (RANS) equations describing fluid motion to numerically solve for low Mach number, compressible flow in vocal tract geometries. Physical experiments, from which real flow quantities are acquired, support the computational approach by validating numerical results.

Both linear acoustic and fluid-dynamic synthesis use vocal tract shapes defined by means of articulatory models. Two models have been used: Tracttalk (Lin, 1990) and the Flanagan-Ishizaka model (Ishizaka, 1976). Both models provide stylized vocal tract shapes defined by a compact set of parameters. These parameters quantify the position and shape of articulators. For example, the parameters primarily used in this study specify the location and size of the main constriction in the vocal tract, the mouth aperture, and the cross-sectional area of the front cavity. These parameters are shown in the schematic below.

The Flanagan-Ishizaka Vocal Tract Model

IV. Achievements

IV.1. Vowel Recognition
Using a spectral representation based on linear-predictive poles and a reduced number of articulatory parameters, a vowel recognition system based on an articulatory representation of speech signals has been designed. In contrast to the articulatory based approach, traditional speech recognition systems have relied on spectral and/or cepstral features. Despite considerable efforts seeking more accurate, compact, and reliable features for robust speech recognition, the articulatory representation of speech has not been exploited due to the difficulty and computational intensity involved in estimating articulatory parameters from speech waveforms. Adaptive voice mimic with optimized open-loop steering and efficient closed-loop control provides a promising solution to the challenge.

A nearly real-time laboratory prototype of the articulatory based recognition system has been implemented and demonstrated. The system can recognize both isolated vowels and vowel strings. A recognition accuracy of more than 97% is obtained. During the recognition computation, dynamically changing sagittal profiles of the vocal tract (corresponding to the input speech) are displayed. The figure below shows the main displays of the recognition prototype.

The Voice Mimic Articulatory Based Vowel Recognition System

IV.2. Mimicking of Unvoiced Fricatives
The adaptive voice mimic system has been extended from vocalic sounds to the mimicking of unvoiced fricative consonants (such as the /s/ in ``sea'' and /f/ in ``fire''). It was found that spectral comparison based on the poles of linear prediction, which works excellently for vowels, does not work equally well for fricatives. The major reason being that for the fricatives there are a number of bound pole/zero pairs. As a result, linear prediction fails to provide accurate estimates of these singularities. Therefore, other feature representations have been explored. The cepstrum representation was chosen since it is relatively compact and produces positive results.

In order to complete the extension of the voice mimic system to fricative sounds, an improved initial estimation of source parameters has been designed to include an efficient voiced/unvoiced decision. Evident discrepancies exist in the frequency content between sounds produced by a source at the glottis (vibration of vocal cords) and sounds produced by a noise source at a constriction in the vocal tract (as is the case for fricatives). These discrepancies make necessary the use of multiple codebooks. The appropriate codebook is selected based on the voiced-unvoiced decision. The estimation of articulatory parameters is then completed by the open-loop steering followed by closed-loop analysis.

This system has produced vowel/consonant/vowel utterances and short sentences of very encouraging quality. Below, are some examples from the voice mimic where the articulatory parameters from the input speech have been used to re-synthesize the speech.

Examples of the Adaptive Voice Mimic for Fricatives
(Sun Audio, 32kHz, 16-bit, linear)

Natural Input Speech Voice Mimic
/usu/ /usu/
/ushu/ /ushu/
/ufu/ /ufu/
she saw a fire she saw a fire

IV.3. Speaker Identification
Physiological information about a particular speaker's vocal tract is ``hidden'' in their speech signal. Acoustic-to-articulatory mapping provides a means to extract this information and use it to differentiate speakers. In particular, vocal tract parameters can be used to supplement traditional speaker identification methods. The advantage of vocal tract parameters is that they are not affected by emotion or sickness, and they cannot be easily altered for the purpose of impersonation.

Preliminary experiments have been done towards the estimation of the vocal tract length from the acoustic signal. This is a critical parameter for differentiating talkers in speaker identification or verification tasks. The estimation is performed using the voice mimic system and a two-step strategy. First, the shape of the vocal tract is determined using a codebook built on a fixed vocal tract length. Then, the vocal tract length is estimated using a detailed codebook comprising variations of the same shape with it's length stretched and compressed. Although such an approach requires advance knowledge of which sound is produced, this problem will be overcome in the future by replacing the second codebook by an optimization loop. Initial results have been obtained using a database which associates X-ray images of the vocal tract and the corresponding speech signal produced. It is shown that the vocal tract length estimated by the voice mimic system agrees well with the measured value.

IV.4. Design Improvements for a Fast-Access Articulatory Codebook
Since a codebook is used to obtain the first estimates of the vocal tract shape that may produce a given combination of acoustic parameters, it must be designed such that it spans the natural articulatory space of a speaker. Furthermore, sampling of the space must be fine enough so that an acoustic entry always exists very close to the global optimum. Such codebooks require a large set of matching pairs of vocal tract and acoustic parameters. The complexity of searching a large codebook for all possible vocal tract model shapes becomes an issue. For this reason, the voice mimic system needs, in addition to a good articulatory codebook, an efficient procedure for accessing the codebook.

The number and position of the codebook vectors affect the performance of the voice mimic system according to two compromising problems. On one hand, increasing the size of the codebook increases the difficulty of the access task and, on the other hand, reduction of this size degrades the quality of the inverse problem solution.

A new design of an articulatory codebook has been completed in which the acoustic space is sub-sampled on a set of ordered acoustic clusters, giving rise to the acoustic network shown below.

Schematic Representation of Vocal Tract Shape Clustering into an Acoustic Network

The inversion of the articulatory-to-acoustic mapping is processed during the building of the articulatory codebook as follows. For each generated vocal tract shape, acoustic parameters are determined. Using the sub-sampling period for each acoustic parameter, the closest node in the network is determined and notified about the position of the shape in the codebook. Thus, each node of the network points to all the model shapes in the codebook that have acoustic parameters close to the acoustic centroid represented by the node.

Once the codebook is built, the access task simply requires estimating the acoustic parameters for each frame of the speech signal, determining the coordinates of the corresponding cluster node in the network using the sub-sampling period of each parameter, and retrieving all possible vocal tract model shapes to which the acoustic node points. This codebook design allows real-time access to the set of acoustically equivalent shapes, regardless the size of the codebook.

IV.5. Real-time Method for Eliminating Non-Uniqueness in Articulatory Trajectories
The non-uniqueness of the acoustic-to-articulatory mappings leads to a non-uniqueness in the vocal tract shape variation trajectory. One needs to address this issue to select the most probable vocal tract shape variation. Based on the slow evolution of the articulation between two successive signal frames, Schroeter and Sondhi (1989) proposed dynamic programming for vocal tract path optimization that relies on the closest vocal tract model shape. This approach was implemented in CAIP prototype Mimic system. However, this technique imposes a delay on the voice mimic output and does not take into account directly the physical dynamic features of the articulators. By studying the articulator motion from muscle activity, Bateson et al.(1993) described a recurrent algorithm to estimate the position of each articulator from continuous EMG signals. A similar network is now implemented in the CAIP system and is shown schematically below.

Network for Dynamic Optimization of Articulatory Trajectories

The network takes into account the dynamic properties of the articulators and performs the forward dynamics of the articulatory parameters according to the slow variation of their respective acceleration during speech production. The following articulatory parameter position is then estimated from the previous position, and from the velocity and acceleration of the articulatory parameter. The estimate is compared to the different parameter positions of the shapes proposed by the articulatory codebook. Then, the shape that has its articulatory model parameters in the candidate positions is chosen as the next vocal tract model shape. This technique leads to a recurrent algorithm for optimization of the vocal tract model shape time evolution.

IV.6. Speech Coding
The articulatory representation is one of the most promising technique for high quality very low bit-rate speech coding. It is thought that such a representation can produce speech coders with rates below 1 kbits per second. Thus, the importance of acoustic to articulatory mapping for the purpose of coding is apparent. Initial experiments testing the coding rate limits that can be tolerated prior to degredation of synthetic speech quality. Coding rates below 1 kbit per second have been achieved. Future updates of this web page will include samples of the coded speech at varying bit-rates, down to and including rates at which quality and intelligibilty suffer.
IV.7 Speech Synthesis from Fluid Flow Principles
As briefly mentioned above, this research involves the investigation of speech as a fluid flow phenomena. Reseach in this area has produced ground-breaking results synthesizing speech solely from principles of fluid flow. Simulating speech using this approach requires massive amounts of compute power, taking 8 to 9 hours of CPU time on a Cray C90 to compute only one-half of a second of synthetic speech. However, these studies reveal information which contributes to a more complete understanding of the physical processes involved in all aspects of speech production.

Simulations of flow through the vocal cords have also been computed. These simulations reveal the fluid-structure interaction involved in the flow-driven oscillations of the vocal cords. In addition solving the Navier-Stokes equations, these experiments require solving a complex set of equations which describe the motions of the vocal cords. Each vocal cord in these initial simulations is modeled as 48 coupled damped-mass-spring systems, leading to 96 first-order differential equations. Simulations with this level of detail in computing both flow, vocal cord motion, and their interaction have never been carried out before.

V. References

VI. Publications


Visit CAIP's Multimedia Lab

Return to HPCD Home PageReturn to HPCD Home Page