Quad Chart

OBJECTIVE

This research extends the capabilities of DARPA automatic speech recognition technology to a new applications dimension - namely, to a distant-talking mode that accommodates unfavorable acoustic environments. It frees the user from body-worn, hand-held, or tethered microphone equipment, and permits natural mobility in the work place - such as conference rooms, combat information centers, situation rooms, or command centers. Additionally, the technique eliminates the costly and time-consuming requirement of retraining the speech recognizer for each application environment.

APPROACH

DARPA continuous speech recognizers are typically trained for close-talking microphones. Performance and reliability are well characterized for this condition. However, more utility is gained if the user can move freely in the work place, talking at a distance from the microphone. This capability is important in "hands-busy, eyes-busy" activities - as in a command center. The foes of distant talking are ambient acoustic noise and room multi-path (reverberation), which degrade the performance of the recognizer. In this research, two techniques are combined in a synergistic way to enable speech recognizers trained for close-talking to operate reliably for distant-talking. Moreover, no expensive and laborious retraining of the speech recognizer is required.

The method accomplishes a continuous mapping of speech features which are distorted by reverberation and noise into the equivalent of those obtained from close-talking. First, a microphone array is used to capture the distant speech source. In the initial implementation, this array is a line array of 23 first-order gradient electret microphones whose outputs are processed for delay-sum beam forming. The array beam mitigates interfering noise and room reverberation, but does not eliminate it completely. Second, a neural network, constituting a multi-layer perceptron, is used to learn the residual multi-path and noise distortion and adapt its weights to achieve an input-to-output mapping that compensates for the residual acoustic distortions.

ACCOMPLISHMENTS

The speech features mapped by the neural network are cepstral coefficients and energy. The neural network is trained on a small amount of stereo speech data, composed of simultaneous close-talking and distant-talking samples. This small amount of adaptation data is negligible compared to the hours of data required to retrain the speech recognizer for a specific environment. Performance measurements on continuous speech recognition show that the prototype Microphone Array and Neural Network (MANN) system is capable of elevating the recognition accuracy of DARPA speech recognizers to acceptable operational levels when they are used for distant-talking in noisy and reverberant environments (computer rooms and conference rooms with ambient acoustic noise of 60-70 dBA).

Gender dependent neural networks have been trained to accommodate more adaptation data. Experimental results show that the microphone array neural network speech recognition system that makes use of this gender dependent neural networks can outperform a recognizer retrained on distant-talking speech, which previously has been thought of as the performance upper bound for distant-talking speech recognizers.

A new objective function for the neural network has been developed, which maximizes the likelihood of distant-talking speech data given the transcription. This Maximum Likelihood Neural Network (MLNN) does NOT require stereo for training. A simplified version of the MLNN has been implemented on a multiprocessor parallel computer (NCUBE IIs). Performance evaluations demonstrate that the simplified version of the MLNN can produce distant-talking speech recognition accuracies that are comparable to those for the neural network which uses stereo data.

The MLNN has been implemented for workstation environments as well as multiprocessor machines. In this way, the workstation microphone array and neural network can be operated in a single machine.

A real-time distant-talking speech recognizer for conversational-speech interface applications has been implemented. This system aims to demonstrate the synergetic use of the microphone array and the neural network for distant-talking speech recognition in terms of both performance and speed.

In parallel work, the microphone array has been increased in sophistication to embody sound capture with spatial selectivity in 3-dimensions. A Matched-Filter processing of every sensor in a 2-dimensional array provides this capability.

Comprehensive experiments on the combination of Matched-Filter array and Maximum Likelihood Linear Regression have been conducted. Recognition performance is significantly improved by using Matched-Filter processing.

The robust distant-talking speech recognition techniques developed at the CAIP Center have been applied to Broadcast News speech recognition.

The CAIP computer cluster has been maintained in service for the DARPA Speech Community.

CURRENT PLAN

A working prototype of the neural network simulator will be prepared for delivery. A comprehensive final project report and documentation of the neural network simulator will also be delivered.

TECHNOLOGY TRANSITION

A complete line-array microphone system has been exported to the DARPA group at Carnegie Mellon University.

The CAIP Center speech group participated for the first time in the recent DARPA HUB4 competition, using a recognizer system of original design. Design data for this recognizer and the evaluation results have been made available to the community.

The microphone array and neural network speech recognition system has been modified for use as a conversational speech interface for the DARPA IC&V program (DISCIPLE system). An initial effort has commenced to program this complete system for real-time performance.