An automatic mixing speech enhancement system for multi-track audio

Read original: arXiv:2404.17821 - Published 4/30/2024 by Xiaojing Liu, Angeliki Mourgela, Hongwei Ai, Joshua D. Reiss

🗣️

Overview

This paper proposes a speech enhancement system for multitrack audio, which aims to minimize auditory masking while allowing listeners to hear multiple simultaneous speakers.
The system can be used in various communication scenarios such as teleconferencing, video gaming, and live streaming.
The ITU-R BS.1387 Perceptual Evaluation of Audio Quality (PEAQ) model is used to evaluate the amount of masking in the audio signals.
Different audio effects, including level balance, equalization, dynamic range compression, and spatialization, are applied via an iterative Harmony searching algorithm to minimize the masking.
Subjective listening tests show that the designed system can compete with mixes by professional sound engineers and outperforms mixes by existing auto-mixing systems.

Plain English Explanation

The proposed speech enhancement system aims to improve the audio quality in situations where multiple people are speaking at the same time, such as in teleconferencing, video gaming, or live streaming. The system uses a specific audio quality model to identify and reduce the amount of "masking" that occurs when multiple voices overlap. Masking happens when one sound makes it harder to hear another sound.

To reduce masking, the system applies a variety of audio processing techniques, such as adjusting the volume levels, equalizing the frequencies, compressing the dynamic range, and spatializing the audio. These adjustments are made using an iterative algorithm that aims to optimize the audio quality.

When tested, the designed system was able to produce audio mixes that were as good as those created by professional sound engineers, and better than mixes created by existing automatic mixing systems. This suggests the system could be useful in a variety of real-world communication scenarios where clear audio is important.

Technical Explanation

The proposed speech enhancement system utilizes the ITU-R BS.1387 Perceptual Evaluation of Audio Quality (PEAQ) model to assess the amount of auditory masking in the input audio signals. This model provides an objective measure of audio quality that considers the human perception of sound.

The system then applies various audio effects, including level balance, equalization, dynamic range compression, and spatialization, to the input audio. These adjustments are made iteratively using a Harmony Search algorithm, which aims to minimize the PEAQ-based masking metric.

The Harmony Search algorithm is a metaheuristic optimization technique inspired by the improvisation process of musical ensembles. In this case, it is used to efficiently explore the vast parameter space of the audio processing effects to find the optimal combination that reduces masking while preserving the intelligibility of the multiple speakers.

Subjective listening tests were conducted to evaluate the performance of the designed system. The results show that the system can produce audio mixes that are comparable to those created by professional sound engineers and outperform existing auto-mixing systems, particularly in scenarios with multiple simultaneous speakers.

Critical Analysis

The paper provides a thorough explanation of the proposed speech enhancement system and its evaluation, but there are a few potential areas for further research and improvement:

Scalability: The paper focuses on the performance of the system with a small number of speakers. It would be interesting to see how the system scales to handle a larger number of simultaneous speakers, as this is a common challenge in real-world communication scenarios like large group meetings.
Real-time performance: The paper does not explicitly address the computational requirements and latency of the system, which are crucial factors for real-time applications like teleconferencing. Further research could explore ways to optimize the system for low-latency, high-performance operation.
Personalization: The current system applies a one-size-fits-all approach to audio processing. Incorporating personalization features, such as user preferences or hearing profiles, could potentially improve the subjective experience for individual listeners.
Robustness: The paper does not discuss the system's performance in the presence of various types of noise or other audio distortions. Evaluating the system's robustness to real-world conditions would be an important next step.

Overall, the proposed speech enhancement system shows promising results and could have valuable applications in a variety of communication scenarios. Further research and development in the areas mentioned above could help unlock the full potential of this technology.

Conclusion

This paper presents a novel speech enhancement system that aims to minimize auditory masking in multitrack audio, allowing listeners to clearly hear multiple simultaneous speakers. The system uses the ITU-R BS.1387 PEAQ model to evaluate audio quality and applies various audio processing techniques, including level balance, equalization, dynamic range compression, and spatialization, to optimize the audio mix.

Subjective listening tests demonstrate that the designed system can produce audio mixes that are comparable to those created by professional sound engineers and outperform existing auto-mixing systems, particularly in scenarios with multiple concurrent speakers. This suggests the system could be a valuable tool for improving audio quality in a wide range of communication applications, from teleconferencing to live streaming.

While the paper provides a solid foundation, further research is needed to explore the scalability, real-time performance, personalization, and robustness of the system. Addressing these areas could help unlock the full potential of this technology and bring high-quality, intelligible audio to an even broader range of communication scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

An automatic mixing speech enhancement system for multi-track audio

Xiaojing Liu, Angeliki Mourgela, Hongwei Ai, Joshua D. Reiss

We propose a speech enhancement system for multitrack audio. The system will minimize auditory masking while allowing one to hear multiple simultaneous speakers. The system can be used in multiple communication scenarios e.g., teleconferencing, invoice gaming, and live streaming. The ITU-R BS.1387 Perceptual Evaluation of Audio Quality (PEAQ) model is used to evaluate the amount of masking in the audio signals. Different audio effects e.g., level balance, equalization, dynamic range compression, and spatialization are applied via an iterative Harmony searching algorithm that aims to minimize the masking. In the subjective listening test, the designed system can compete with mixes by professional sound engineers and outperforms mixes by existing auto-mixing systems.

4/30/2024

Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance

Manuel Milling, Shuo Liu, Andreas Triantafyllopoulos, Ilhan Aslan, Bjorn W. Schuller

Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and non-speech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios (SNRs), for a wide range of computer audition tasks in everyday-life noisy environments.

8/13/2024

🗣️

Flexible Multichannel Speech Enhancement for Noise-Robust Frontend

Ante Juki'c, Jagadeesh Balam, Boris Ginsburg

This paper proposes a flexible multichannel speech enhancement system with the main goal of improving robustness of automatic speech recognition (ASR) in noisy conditions. The proposed system combines a flexible neural mask estimator applicable to different channel counts and configurations and a multichannel filter with automatic reference selection. A transform-attend-concatenate layer is proposed to handle cross-channel information in the mask estimator, which is shown to be effective for arbitrary microphone configurations. The presented evaluation demonstrates the effectiveness of the flexible system for several seen and unseen compact array geometries, matching the performance of fixed configuration-specific systems. Furthermore, a significantly improved ASR performance is observed for configurations with randomly-placed microphones.

6/10/2024

Personalized Speech Enhancement Without a Separate Speaker Embedding Model

Tanel Parnamaa, Ando Saabas

Personalized speech enhancement (PSE) models can improve the audio quality of teleconferencing systems by adapting to the characteristics of a speaker's voice. However, most existing methods require a separate speaker embedding model to extract a vector representation of the speaker from enrollment audio, which adds complexity to the training and deployment process. We propose to use the internal representation of the PSE model itself as the speaker embedding, thereby avoiding the need for a separate model. We show that our approach performs equally well or better than the standard method of using a pre-trained speaker embedding model on noise suppression and echo cancellation tasks. Moreover, our approach surpasses the ICASSP 2023 Deep Noise Suppression Challenge winner by 0.15 in Mean Opinion Score.

6/17/2024