Microphone Array Research at CAIP 

  • Introduction
  • HMA
  • Sound Capture
  • Source Location
  • Sensor Placement
  • Array-Based Speech Recognition
  • Publications
  • References
  • People


  • Maintained by Chris Alvino.  Last updated 11/97. 

    Array-Based Speech Recognition 

    Most contemporary speech recognizers are designed to operate with close-talking speech and they work best in a quiet laboratory condition. There is an apparent need to render environment robustness to these systems. CAIP research has been directed at exploring the utility of existing speech recognition technology in adverse "real-world" environments for distant-talking applications. A synergistic system consisting of Microphone Array and Neural Network (MANN) was utilized to mitigate environmental interference introduced by reverberation, ambient noise, and channel mismatch between training and testing conditions. The MANN system was evaluated with experiments on continuous distant-talking speech recognition. The results show that the MANN system elevates the word recognition accuracy to a level which is competitive with a retrained speech recognizer and that the neural network compensation performs better than some previously researched techniques. 

    High-quality speech input and a matched training/testing condition are two important factors determining performance of speech recognition systems. Therefore, most existing speech recognizers are designed to operate with close-talking microphone input and work best under a quiet laboratory condition with matched training and testing. The recognition performance is typically degraded when these recognizers are directly deployed for distant-talking speech recognition in variable acoustic environments. The degradation is due to (a) deteriorated speech signal because of multi-path distortion and ambient noise interference; and (b) a mismatched training/testing condition of the recognizers. 

    Research has been conducted toward applying existing speech recognition systems, trained on close-talking, to distant-talking speech recognition. To mitigate environmental mismatches between close-talking and distant-talking, a front-end system consisting of a microphone array and a neural network (MANN) was developed. Two synergistic components are included in the MANN system: (1) speech enhancement by microphone arrays to mitigate room reverberation and noise interference; and (2) feature adaptation by neural network processing to approximate a matched training/testing condition for the recognizer. 

    The MANN system has the following advantages. It allows the user to speak at distances from the microphone without being encumbered by hand-held, body-worn, or tethered microphone equipment. This hands-free advantage is appreciated and sometime necessitated in many hands-busy, eyes-busy applications. Examples include large group conferencing where hands-free sound pick contributes to virtually face-to-face meeting atmosphere. The MANN system also allows more efficient adaptation to new application environments. This is because only the neural network needs to be adapted. Adapting a neural network requires much less speech data than (re-)training a large vocabulary, speaker-independent speech recognizer.