RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization

Read original: arXiv:2406.19959 - Published 7/1/2024 by Bing Yang, Changsheng Quan, Yabo Wang, Pengyu Wang, Yujie Yang, Ying Fang, Nian Shao, Hui Bu, Xin Xu, Xiaofei Li
Total Score

0

RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper presents RealMAN, a real-recorded and annotated microphone array dataset for dynamic speech enhancement and localization.
  • The dataset consists of recordings of speech, background noise, and moving sound sources in a real-world indoor environment.
  • The dataset is designed to support research in areas like speech enhancement, source localization, and audio-visual scene understanding.

Plain English Explanation

The researchers created a new dataset called RealMAN that contains recordings of speech, background noise, and moving sound sources made using a microphone array in a real indoor environment. This dataset is designed to help researchers improve technologies like speech enhancement (improving the quality of speech recordings) and source localization (determining where sounds are coming from).

This dataset is unique because it captures real-world acoustic conditions, including background noises and moving sound sources, rather than using simulated or controlled environments. This makes it more representative of the challenges that audio processing systems face in real-world applications, compared to datasets recorded in anechoic chambers or other simplified settings. The researchers hope that RealMAN will enable the development of more robust and capable audio processing algorithms.

Technical Explanation

The RealMAN dataset was recorded using a 32-channel microphone array in a typical indoor office environment. It contains recordings of various speech and non-speech sound sources, including:

  • Static and moving speakers
  • Background noise from computers, air conditioning, and other sources
  • Sound-emitting objects (e.g. a ringing phone)

Each recording is accompanied by detailed annotations, including:

  • Speaker locations and movements
  • Locations and types of background noise sources
  • Timestamps for events like speech and sound source movements

[The dataset is designed to support research in areas like source localization, speech enhancement, and audio-visual scene understanding. The researchers hope that by providing a realistic and challenging dataset, they can spur the development of more advanced audio processing algorithms that can handle the complexities of real-world acoustic environments.

Critical Analysis

The RealMAN dataset represents a valuable contribution to the field of audio processing research. By capturing real-world acoustic conditions, it addresses limitations of existing datasets that use simulated or simplified environments. This added realism is important for developing audio algorithms that can perform well in practical applications.

However, the dataset is limited to a single indoor office environment. While this environment is representative of many real-world scenarios, expanding the dataset to include a wider variety of acoustic environments (e.g. different room sizes, materials, and background noise sources) could further improve its utility. Additionally, the annotations, while detailed, may not capture all the nuances of the acoustic scene that could be useful for advanced audio processing tasks.

Conclusion

The RealMAN dataset provides a valuable new resource for researchers working on audio processing tasks like speech enhancement and source localization. By capturing realistic acoustic conditions, it represents an important step towards developing more robust and capable audio algorithms that can handle the complexities of real-world environments. While the dataset has some limitations, it is a significant contribution to the field and should spur further advancements in this area of research.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization
Total Score

0

RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization

Bing Yang, Changsheng Quan, Yabo Wang, Pengyu Wang, Yujie Yang, Ying Fang, Nian Shao, Hui Bu, Xin Xu, Xiaofei Li

The training of deep learning-based multichannel speech enhancement and source localization systems relies heavily on the simulation of room impulse response and multichannel diffuse noise, due to the lack of large-scale real-recorded datasets. However, the acoustic mismatch between simulated and real-world data could degrade the model performance when applying in real-world scenarios. To bridge this simulation-to-real gap, this paper presents a new relatively large-scale Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset. The proposed dataset is valuable in two aspects: 1) benchmarking speech enhancement and localization algorithms in real scenarios; 2) offering a substantial amount of real-world training data for potentially improving the performance of real-world applications. Specifically, a 32-channel array with high-fidelity microphones is used for recording. A loudspeaker is used for playing source speech signals. A total of 83-hour speech signals (48 hours for static speaker and 35 hours for moving speaker) are recorded in 32 different scenes, and 144 hours of background noise are recorded in 31 different scenes. Both speech and noise recording scenes cover various common indoor, outdoor, semi-outdoor and transportation environments, which enables the training of general-purpose speech enhancement and source localization networks. To obtain the task-specific annotations, the azimuth angle of the loudspeaker is annotated with an omni-direction fisheye camera by automatically detecting the loudspeaker. The direct-path signal is set as the target clean speech for speech enhancement, which is obtained by filtering the source speech signal with an estimated direct-path propagation filter.

Read more

7/1/2024

ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data
Total Score

0

ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data

Zeyi Liu, Cheng Chi, Eric Cousineau, Naveen Kuppuswamy, Benjamin Burchfiel, Shuran Song

Audio signals provide rich information for the robot interaction and object properties through contact. These information can surprisingly ease the learning of contact-rich robot manipulation skills, especially when the visual information alone is ambiguous or incomplete. However, the usage of audio data in robot manipulation has been constrained to teleoperated demonstrations collected by either attaching a microphone to the robot or object, which significantly limits its usage in robot learning pipelines. In this work, we introduce ManiWAV: an 'ear-in-hand' data collection device to collect in-the-wild human demonstrations with synchronous audio and visual feedback, and a corresponding policy interface to learn robot manipulation policy directly from the demonstrations. We demonstrate the capabilities of our system through four contact-rich manipulation tasks that require either passively sensing the contact events and modes, or actively sensing the object surface materials and states. In addition, we show that our system can generalize to unseen in-the-wild environments, by learning from diverse in-the-wild human demonstrations. Project website: https://mani-wav.github.io/

Read more

7/1/2024

LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization
Total Score

0

LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

Zengrui Jin, Yifan Yang, Mohan Shi, Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Lingwei Meng, Long Lin, Yong Xu, Shi-Xiong Zhang, Daniel Povey

The evolving speech processing landscape is increasingly focused on complex scenarios like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions. Existing methodologies for addressing these challenges fall into two categories: multi-channel and single-channel solutions. Single-channel approaches, notable for their generality and convenience, do not require specific information about microphone arrays. This paper presents a large-scale far-field overlapping speech dataset, crafted to advance research in speech separation, recognition, and speaker diarization. This dataset is a critical resource for decoding ``Who said What and When'' in multi-talker, reverberant environments, a daunting challenge in the field. Additionally, we introduce a pipeline system encompassing speech separation, recognition, and diarization as a foundational benchmark. Evaluations on the WHAMR! dataset validate the broad applicability of the proposed data.

Read more

9/4/2024

🗣️

Total Score

0

Towards Solving Cocktail-Party: The First Method to Build a Realistic Dataset with Ground Truths for Speech Separation

Rawad Melhem, Assef Jafar, Oumayma Al Dakkak

Speech separation is very important in real-world applications such as human-machine interaction, hearing aids devices, and automatic meeting transcription. In recent years, a significant improvement occurred towards the solution based on deep learning. In fact, much attention has been drawn to supervised learning methods using synthetic mixtures datasets despite their being not representative of real-world mixtures. The difficulty in building a realistic dataset led researchers to use unsupervised learning methods, because of their ability to handle realistic mixtures directly. The results of unsupervised learning methods are still unconvincing. In this paper, a method is introduced to create a realistic dataset with ground truth sources for speech separation. The main challenge in designing a realistic dataset is the unavailability of ground truths for speakers signals. To address this, we propose a method for simultaneously recording two speakers and obtaining the ground truth for each. We present a methodology for benchmarking our realistic dataset using a deep learning model based on Bidirectional Gated Recurrent Units (BGRU) and clustering algorithm. The experiments show that our proposed dataset improved SI-SDR (Scale Invariant Signal to Distortion Ratio) by 1.65 dB and PESQ (Perceptual Evaluation of Speech Quality) by approximately 0.5. We also evaluated the effectiveness of our method at different distances between the microphone and the speakers and found that it improved the stability of the learned model.

Read more

8/29/2024