Cluster II. Hypercomputing in Design Tasks Supported by Computational
Fluid Dynamics (CFD)
Area II.3 Design of 'voice mimic' speech generation systems
Area Coordinator:
I. Introduction
The research on the adaptive voice mimic aims to advance fundamental
understanding of human speech generation and coalesces the problems of
speech synthesis, speech recognition, and low bit-rate speech coding
into a compact parametric framework. At its core, the mimic system
utilizes optimization techniques and a computationally-intensive model
of speech generation to provide a high quality estimate, moment by
moment, of articulatory parameters from an acoustic speech signal.
The estimation of articulatory parameters is accomplished through a
two-step process: an open-loop (table look-up based) initial
estimation followed by a closed-loop optimization refinement.
II. Articulatory Shape Estimation
Starting from an acoustic
input, the open-loop (i.e., with no optimization) estimate of the
articulatory parameters is obtained via a table look-up of precomputed
synthetic speech representations. Each element in the table is stored
with the articulatory parameters from which it was produced. The
input speech is compared with the synthetic speech in the table via a
spectral representation, and the articulatory shape corresponding to
the ``closest'' synthetic speech is selected. Once initial articulatory
estimates are found for a series of speech segments, a dynamic
programming module provides smooth articulatory trajectories by
imposing articulatory constraints. This concludes the open-loop
process.
The open-loop estimates, initialize a closed-loop optimization by
suggesting a starting position which is likely in the vicinity of the
(global) optimal solution. Effective open-loop estimates reduce the
computation required by the computationally-costly optimization loop.
Within the closed-loop optimization, synthetic speech is generated
from a compact set of articulatory parameters and compared with the
input speech using a perceptually weighted distance metric. The
articulatory parameters are iteratively adjusted based on the result
of the comparison so that the weighted spectral distance between the
arbitrary speech input and the synthetic speech is driven to below a
preset threshold.
III. Articulatory Speech Synthesis
As part of this research, two methods of high quality speech synthesis
from articulatory parameters are studied. The first method is based
on linear acoustic theory/models of speech production, and the second
method is based on a fluid-dynamic formulation. The techniques for the
first method are relatively well established, but the method assumes
plane wave propagation inside the vocal tract and also neglects most
of the non-linear terms. On the other hand, the second, new method
attempts to capture more accurately the physics behind human speech
production. This is done by formulating the speech production process
as a fluid-dynamic phenomenon. The approach uses a form of the
Reynolds-Averaged Navier-Stokes (RANS) equations describing fluid
motion to numerically solve for low Mach number, compressible flow in
vocal tract geometries. Physical experiments, from which real flow
quantities are acquired, support the computational approach by
validating numerical results.
Both linear acoustic and fluid-dynamic synthesis use vocal tract
shapes defined by means of articulatory models. Two models have been
used: Tracttalk (Lin, 1990) and the Flanagan-Ishizaka model
(Ishizaka, 1976). Both models provide stylized vocal tract shapes
defined by a compact set of parameters. These parameters quantify the
position and shape of articulators. For example, the parameters
primarily used in this study specify the location and size of the main
constriction in the vocal tract, the mouth aperture, and the
cross-sectional area of the front cavity. These parameters are shown
in the schematic below.
The Flanagan-Ishizaka Vocal Tract Model
IV. Achievements
IV.1. Vowel Recognition
Using a spectral representation based on linear-predictive poles and
a reduced number of articulatory parameters, a vowel recognition system
based on an articulatory representation of speech signals has been
designed. In contrast to the articulatory based approach, traditional
speech recognition systems have relied on spectral and/or cepstral
features. Despite considerable efforts seeking more accurate,
compact, and reliable features for robust speech recognition, the
articulatory representation of speech has not been exploited due to
the difficulty and computational intensity involved in estimating
articulatory parameters from speech waveforms. Adaptive voice mimic
with optimized open-loop steering and efficient closed-loop control
provides a promising solution to the challenge.
A nearly real-time laboratory prototype of the articulatory based
recognition system has been implemented and demonstrated. The system
can recognize both isolated vowels and vowel strings. A recognition
accuracy of more than 97% is obtained. During the recognition
computation, dynamically changing sagittal profiles of the vocal tract
(corresponding to the input speech) are displayed. The figure below
shows the main displays of the recognition prototype.
The Voice Mimic Articulatory Based Vowel Recognition System
IV.2. Mimicking of Unvoiced Fricatives
The adaptive voice
mimic system has been extended from vocalic sounds to the mimicking of
unvoiced fricative consonants (such as the /s/ in ``sea'' and /f/ in
``fire''). It was found that spectral comparison based on the poles of
linear prediction, which works excellently for vowels, does not work
equally well for fricatives. The major reason being that for the
fricatives there are a number of bound pole/zero pairs. As a result,
linear prediction fails to provide accurate estimates of these
singularities. Therefore, other feature representations have been
explored. The cepstrum representation was chosen since it is
relatively compact and produces positive results.
In order to complete the extension of the voice mimic system to
fricative sounds, an improved initial estimation of source parameters
has been designed to include an efficient voiced/unvoiced decision.
Evident discrepancies exist in the frequency content between sounds
produced by a source at the glottis (vibration of vocal cords) and
sounds produced by a noise source at a constriction in the vocal tract
(as is the case for fricatives). These discrepancies make necessary
the use of multiple codebooks. The appropriate codebook is selected
based on the voiced-unvoiced decision. The estimation of articulatory
parameters is then completed by the open-loop steering followed by
closed-loop analysis.
This system has produced vowel/consonant/vowel utterances and short
sentences of very encouraging quality. Below, are some
examples from the voice mimic where the articulatory parameters from
the input speech have been used to re-synthesize the speech.
Examples of the Adaptive Voice Mimic for Fricatives
(Sun Audio, 32kHz, 16-bit, linear)
IV.3. Speaker Identification
Physiological information about
a particular speaker's vocal tract is ``hidden'' in their speech signal.
Acoustic-to-articulatory mapping provides a means to extract
this information and use it to differentiate speakers. In particular,
vocal tract parameters can be used to supplement traditional speaker
identification methods. The advantage of vocal tract parameters is
that they are not affected by emotion or sickness, and they cannot be
easily altered for the purpose of impersonation.
Preliminary experiments have been done towards the estimation of the
vocal tract length from the acoustic signal. This is a critical
parameter for differentiating talkers in speaker identification or
verification tasks. The estimation is performed using the voice mimic
system and a two-step strategy. First, the shape of the vocal tract
is determined using a codebook built on a fixed vocal tract length.
Then, the vocal tract length is estimated using a detailed codebook
comprising variations of the same shape with it's length stretched and
compressed. Although such an approach requires advance knowledge of
which sound is produced, this problem will be overcome in the future
by replacing the second codebook by an optimization loop. Initial
results have been obtained using a database which associates X-ray
images of the vocal tract and the corresponding speech signal
produced. It is shown that the vocal tract length estimated by the
voice mimic system agrees well with the measured value.
IV.4. Design Improvements for a Fast-Access Articulatory Codebook
Since a codebook is used to obtain the first estimates of the vocal
tract shape that may produce a given combination of acoustic
parameters, it must be designed such that it spans the natural
articulatory space of a speaker. Furthermore, sampling of the space
must be fine enough so that an acoustic entry always exists very close
to the global optimum. Such codebooks require a large set of matching
pairs of vocal tract and acoustic parameters. The complexity of
searching a large codebook for all possible vocal tract model shapes
becomes an issue. For this reason, the voice mimic system needs, in
addition to a good articulatory codebook, an efficient procedure for
accessing the codebook.
The number and position of the codebook vectors affect the performance
of the voice mimic system according to two compromising problems. On
one hand, increasing the size of the codebook increases the difficulty
of the access task and, on the other hand, reduction of this size
degrades the quality of the inverse problem solution.
A new design of an articulatory codebook has been completed in which
the acoustic space is sub-sampled on a set of ordered acoustic
clusters, giving rise to the acoustic network shown below.
Schematic Representation of Vocal Tract Shape Clustering into an Acoustic Network
The inversion of the
articulatory-to-acoustic mapping is processed during the building of the
articulatory codebook as follows. For each generated vocal tract shape,
acoustic parameters are determined. Using the sub-sampling period for
each acoustic parameter, the closest node in the network is determined and
notified about the position of the shape in the codebook. Thus,
each node of the network points to all the model shapes in the
codebook that have acoustic parameters close to the acoustic centroid
represented by the node.
Once the codebook is built, the access task simply requires estimating
the acoustic parameters for each frame of the speech signal,
determining the coordinates of the corresponding cluster node in the
network using the sub-sampling period of each parameter, and
retrieving all possible vocal tract model shapes to which the acoustic
node points. This codebook design allows real-time access to the
set of acoustically equivalent shapes, regardless the size of the
codebook.
IV.5. Real-time Method for Eliminating Non-Uniqueness in Articulatory Trajectories
The non-uniqueness of the acoustic-to-articulatory mappings
leads to a non-uniqueness in the vocal tract shape variation trajectory.
One needs to address this issue to select the most probable
vocal tract shape variation. Based on the slow evolution of the
articulation between two successive signal frames, Schroeter and Sondhi
(1989) proposed dynamic programming for vocal tract path
optimization that relies on the closest vocal tract model shape.
This approach was implemented in CAIP prototype Mimic system.
However, this technique imposes a delay on the voice mimic
output and does not take into account directly the physical dynamic
features of the articulators.
By studying the articulator motion from muscle activity, Bateson
et al.(1993) described a recurrent algorithm to estimate the
position of each articulator from continuous EMG signals. A similar
network is now implemented in the CAIP system and is shown
schematically below.
Network for Dynamic Optimization of Articulatory Trajectories
The network takes into account the dynamic properties of the
articulators and performs the forward dynamics of the articulatory
parameters according to the slow variation of their respective
acceleration during speech production. The following articulatory
parameter position is then estimated from the previous position, and
from the velocity and acceleration of the articulatory parameter. The
estimate is compared to the different parameter positions of the
shapes proposed by the articulatory codebook. Then, the shape that has
its articulatory model parameters in the candidate positions is chosen
as the next vocal tract model shape. This technique leads to a
recurrent algorithm for optimization of the vocal tract model shape
time evolution.
IV.6. Speech Coding
The articulatory representation is one of the most promising technique
for high quality very low bit-rate speech coding. It is thought that
such a representation can produce speech coders with rates below 1 kbits per
second. Thus, the importance of acoustic to articulatory mapping
for the purpose of coding is apparent.
Initial experiments testing the coding rate limits that can be
tolerated prior to degredation of synthetic speech quality.
Coding rates below 1 kbit per second have been achieved. Future
updates of this web page will include samples of the coded speech at
varying bit-rates, down to and including rates at which quality and
intelligibilty suffer.
IV.7 Speech Synthesis from Fluid Flow Principles
As briefly mentioned above, this research involves the investigation
of speech as a fluid flow phenomena. Reseach in this area has
produced ground-breaking results synthesizing speech solely from
principles of fluid flow. Simulating speech using this approach
requires massive amounts of compute power, taking 8 to 9 hours of CPU
time on a Cray C90 to compute only one-half of a second of synthetic
speech. However, these studies reveal information which contributes to
a more complete understanding of the physical processes involved in
all aspects of speech production.
Simulations of flow through the vocal cords have also been
computed. These simulations reveal the fluid-structure interaction
involved in the flow-driven oscillations of the vocal cords. In
addition solving the Navier-Stokes equations, these experiments
require solving a complex set of equations which describe the motions
of the vocal cords. Each vocal cord in these initial simulations is
modeled as 48 coupled damped-mass-spring systems, leading to 96 first-order
differential equations. Simulations with this level of detail in
computing both flow, vocal cord motion, and their interaction have
never been carried out before.
V. References
- J. Schroeter and M.M. Sondhi,"Dynamic Programming Search of Articulatory Codebooks," ICASSP, Glasgow, 1989.
- E. Bateson, M. Hirayama, and Y. Wada, "Generating Articulator Motion from Muscle Activity Using Artificial Neural Networks," ATR HIP Res. Labs. 2, pp. 264-274, 1993.
VI. Publications
- S. Chennoukh, D. Sinder, G. Richard, J. Flanagan, ``Voice Mimic
System Using an Articulatory Codebook for Estimation of Vocal Tract
Shape,'' Accepted for publication in Proceedings of Eurospeech
1997, Rhodes, Greece, September 22-25, 1997.
- S. Chennoukh, D. Sinder, G. Richard, and J. Flanagan, ``Methods
for Acoustic-to-Articulatory Mapping and Voice Mimic Systems,''
Presented at the 133rd Meeting of the Acoustical Society of
America, State College, Pennsylvania, June 1997.
- D. Sinder, G. Richard, H. Duncan, J. Flanagan, S. Slimon,
D. Davis, M. Krane, S. Levinson, "Flow Visualization in Stylized Vocal
Tracts," Proceedings of the International Symposium
on Simulation, Visualization and Auralization for Acoustic Research
and Education (ASVA97), April 1997, Tokyo, Japan.
- D. Sinder, M. Krane, S. Chennoukh, G. Richard, J. Flanagan,
S. Levinson, S. Slimon, D. Davis, ``Fluid Dynamic Studies of Speech
Production,'' Presented at the 133rd meeting of the Acoustical
Society of America, State College, Pennsylvania, June 1997.
- G. Richard, M. Goirand, D. Sinder, J. Flanagan, "Simulation and
Visualization of Articulatory Trajectories Estimated from Speech
Signals," Proceedings of the International Symposium on
Simulation, Visualization and Auralization for Acoustic Research and
Education (ASVA97), April 1997, Tokyo, Japan.
- S. Slimon, D. Davis, S. Levinson, M. Krane, G. Richard, D. Sinder,
H. Duncan, Q. Lin, J. Flanagan, ``Low Mach number Flow Through A
Constricted, Stylized Vocal Tract'', American Institute of Aeronautics
and Astronautics Conference (AIAA96), Penn State Univ., PA., May 1996.
- D. Sinder, G. Richard, H. Duncan, Q. Lin, J. Flanagan, S.
Levinson, D. Davis, and S. Slimon, ``A fluid flow approach to speech
generation'', First ESCA Tutorial and Research Workshop on Speech
Production Modeling: From control strategies to Acoustic, Autrans,
France, May 21-24, 1996.
- G. Richard, Q. Lin, F. Zussa, D. Sinder, C. Che, and J. Flanagan,
``Vowel recognition using an articulatory representation,''
JASA, Vol. 98, No. 5, Pt. 2, November 1995, p. 2965.
- F. Zussa, Q. Lin, G. Richard, D. Sinder, and J. Flanagan,
``Open-loop acoustic-to-articulatory mapping,'' JASA,Vol. 98,
No. 5, Pt. 2, November 1995, p. 2931.
- G. Richard, M. Liu, D. Sinder, H. Duncan, Q. Lin, J. Flanagan, S.
Levinson, D. Davis, and S. Slimon, ``Numerical simulations of fluid
flow in the vocal tract,'' Proc. of 1995 Eurospeech, pp. 1297-1300.
Madrid, Spain, September 18-21, 1995.
- Q. Lin, G. Richard, J. Zou, D. Sinder, J. Flanagan, ``Use of
TRACTTALK for adaptive voice mimic,'',JASA, Vol. 97, No 5, Pt
2, May 1995, p. 3247.
- G. Richard, M. Liu, D. Sinder, H. Duncan, Q. Lin,
J. Flanagan, S. Levinson, D. Davis, S. Slimon, ``Vocal tract
simulations based on fluid dynamic analysis,'', JASA, Vol. 97, No 5,
Pt 2, May 1995, pp3245.
Visit
CAIP's Multimedia Lab
Return to HPCD Home Page