| Microphone Array Research at CAIP
Maintained by Chris Alvino. Last updated 11/97. |
Array-Based Speech Recognition
Most contemporary speech recognizers are designed to operate with close-talking
speech and they work best in a quiet laboratory condition. There is an
apparent need to render environment robustness to these systems. CAIP research
has been directed at exploring the utility of existing speech recognition
technology in adverse "real-world" environments for distant-talking applications.
A synergistic system consisting of Microphone Array and Neural Network
(MANN) was utilized to mitigate environmental interference introduced by
reverberation, ambient noise, and channel mismatch between training and
testing conditions. The MANN system was evaluated with experiments on continuous
distant-talking speech recognition. The results show that the MANN system
elevates the word recognition accuracy to a level which is competitive
with a retrained speech recognizer and that the neural network compensation
performs better than some previously researched techniques.
High-quality speech input and a matched training/testing condition are two important factors determining performance of speech recognition systems. Therefore, most existing speech recognizers are designed to operate with close-talking microphone input and work best under a quiet laboratory condition with matched training and testing. The recognition performance is typically degraded when these recognizers are directly deployed for distant-talking speech recognition in variable acoustic environments. The degradation is due to (a) deteriorated speech signal because of multi-path distortion and ambient noise interference; and (b) a mismatched training/testing condition of the recognizers. Research has been conducted toward applying existing speech recognition systems, trained on close-talking, to distant-talking speech recognition. To mitigate environmental mismatches between close-talking and distant-talking, a front-end system consisting of a microphone array and a neural network (MANN) was developed. Two synergistic components are included in the MANN system: (1) speech enhancement by microphone arrays to mitigate room reverberation and noise interference; and (2) feature adaptation by neural network processing to approximate a matched training/testing condition for the recognizer. The MANN system has the following advantages. It allows the user to speak at distances from the microphone without being encumbered by hand-held, body-worn, or tethered microphone equipment. This hands-free advantage is appreciated and sometime necessitated in many hands-busy, eyes-busy applications. Examples include large group conferencing where hands-free sound pick contributes to virtually face-to-face meeting atmosphere. The MANN system also allows more efficient adaptation to new application environments. This is because only the neural network needs to be adapted. Adapting a neural network requires much less speech data than (re-)training a large vocabulary, speaker-independent speech recognizer. |