Evaluating Speech Enhancement Systems Through Listening Effort

Read original: arXiv:2405.07641 - Published 7/10/2024 by Femke B. Gelderblom, Tron V. Tronstad, Iv'an L'opez-Espejo

🗣️

Overview

This study proposes a simple method to simultaneously evaluate speech intelligibility and listening effort (LE) without additional burden on test subjects or operators.
The method is evaluated using data from two independent studies conducted in Norway and Denmark, involving 76 subjects across 9 processing conditions.
Despite differences in the evaluation setups, subject recruitment, and processing systems, the results show strikingly similar trends, demonstrating the robustness and ease of implementation of the proposed method.

Plain English Explanation

Understanding speech can be difficult, especially when the audio quality is poor. This increased listening effort (LE) can be tiring for the listener. Evaluating how well speech enhancement systems improve intelligibility and reduce LE can provide valuable insights. However, existing methods for measuring LE are complex and not widely used.

This study introduces a straightforward approach to assess speech intelligibility and LE at the same time, without adding extra strain on the people involved in the tests. The researchers tested this method using data from two separate studies conducted in Norway and Denmark, with a total of 76 participants and 9 different audio processing conditions.

Even though the studies had some differences in their setups, subject selection, and processing systems, the results showed very similar patterns. This suggests that the proposed method is robust and easy to incorporate into existing practices for evaluating speech enhancement technologies.

Technical Explanation

The paper presents a simple method to simultaneously evaluate speech intelligibility and listening effort (LE) without additional burden on test subjects or operators. The researchers assessed this method using data from two independent studies conducted in Norway and Denmark, involving a total of 76 (50+26) subjects across 9 (6+3) processing conditions.

The experimental setup involved presenting subjects with speech samples processed under different conditions, such as automatic mixing speech enhancement systems or speech enhancement techniques. Subjects were asked to repeat the words they heard (a measure of intelligibility) and also provide a rating of their perceived LE.

Despite differences in the evaluation setups, subject recruitment, and processing systems between the two studies, the results showed strikingly similar trends. This suggests that the proposed method is robust and can be easily integrated into existing practices for evaluating speech processing and machine learning approaches or text-to-speech systems.

Critical Analysis

The paper presents a promising approach for simultaneously assessing speech intelligibility and listening effort, which addresses the limitations of existing complex methods. The use of data from two independent studies strengthens the findings and demonstrates the robustness of the proposed technique.

However, the paper does not provide detailed information on the specific processing conditions or the nature of the speech samples used in the experiments. Additionally, the study only involved a relatively small number of subjects, and the generalizability of the results to a broader population or different types of speech enhancement systems is not fully addressed.

Further research could explore the application of this method to a wider range of speech processing approaches to mitigating speaking assessment challenges, as well as investigate its sensitivity to different types of speech degradation and processing artifacts. Validating the method with larger and more diverse subject populations would also help establish its reliability and practical utility.

Conclusion

This study presents a simple and efficient method for simultaneously evaluating speech intelligibility and listening effort, which can be a valuable tool for assessing the performance of speech enhancement systems. The robust and consistent results across two independent studies suggest that this approach can be readily integrated into existing practices, providing researchers and developers with a practical way to objectively measure the benefits of their speech processing solutions for end-users.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

Evaluating Speech Enhancement Systems Through Listening Effort

Femke B. Gelderblom, Tron V. Tronstad, Iv'an L'opez-Espejo

Understanding degraded speech is demanding, requiring increased listening effort (LE). Evaluating processed and unprocessed speech with respect to LE can objectively indicate if speech enhancement systems benefit listeners. However, existing methods for measuring LE are complex and not widely applicable. In this study, we propose a simple method to evaluate speech intelligibility and LE simultaneously without additional strain on subjects or operators. We assess this method using results from two independent studies in Norway and Denmark, testing 76 (50+26) subjects across 9 (6+3) processing conditions. Despite differences in evaluation setups, subject recruitment, and processing systems, trends are strikingly similar, demonstrating the proposed method's robustness and ease of implementation into existing practices.

7/10/2024

🗣️

Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge

Simon Leglaive, Matthieu Fraticelli, Hend ElGhazaly, L'eonie Borne, Mostafa Sadeghi, Scott Wisdom, Manuel Pariente, John R. Hershey, Daniel Pressnitzer, Jon P. Barker

Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.

7/11/2024

Using Speech Foundational Models in Loss Functions for Hearing Aid Speech Enhancement

Robert Sutherland, George Close, Thomas Hain, Stefan Goetze, Jon Barker

Machine learning techniques are an active area of research for speech enhancement for hearing aids, with one particular focus on improving the intelligibility of a noisy speech signal. Recent work has shown that feature encodings from self-supervised speech representation models can effectively capture speech intelligibility. In this work, it is shown that the distance between self-supervised speech representations of clean and noisy speech correlates more strongly with human intelligibility ratings than other signal-based metrics. Experiments show that training a speech enhancement model using this distance as part of a loss function improves the performance over using an SNR-based loss function, demonstrated by an increase in HASPI, STOI, PESQ and SI-SNR scores. This method takes inference of a high parameter count model only at training time, meaning the speech enhancement model can remain smaller, as is required for hearing aids.

7/19/2024

🗣️

New!Ultra-Low Latency Speech Enhancement - A Comprehensive Study

Haibin Wu, Sebastian Braun

Speech enhancement models should meet very low latency requirements typically smaller than 5 ms for hearing assistive devices. While various low-latency techniques have been proposed, comparing these methods in a controlled setup using DNNs remains blank. Previous papers have variations in task, training data, scripts, and evaluation settings, which make fair comparison impossible. Moreover, all methods are tested on small, simulated datasets, making it difficult to fairly assess their performance in real-world conditions, which could impact the reliability of scientific findings. To address these issues, we comprehensively investigate various low-latency techniques using consistent training on large-scale data and evaluate with more relevant metrics on real-world data. Specifically, we explore the effectiveness of asymmetric windows, learnable windows, adaptive time domain filterbanks, and the future-frame prediction technique. Additionally, we examine whether increasing the model size can compensate for the reduced window size, as well as the novel Mamba architecture in low-latency environments.

9/17/2024