Does Whisper understand Swiss German? An automatic, qualitative, and human evaluation

2404.19310

Published 5/10/2024 by Eyal Liron Dolev, Clemens Fidel Lutz, Noemi Aepli

✨

Abstract

Whisper is a state-of-the-art automatic speech recognition (ASR) model (Radford et al., 2022). Although Swiss German dialects are allegedly not part of Whisper's training data, preliminary experiments showed that Whisper can transcribe Swiss German quite well, with the output being a speech translation into Standard German. To gain a better understanding of Whisper's performance on Swiss German, we systematically evaluate it using automatic, qualitative, and human evaluation. We test its performance on three existing test sets: SwissDial (Dogan-Schonberger et al., 2021), STT4SG-350 (Pluss et al., 2023), and Swiss Parliaments Corpus (Pluss et al., 2021). In addition, we create a new test set for this work, based on short mock clinical interviews. For automatic evaluation, we used word error rate (WER) and BLEU. In the qualitative analysis, we discuss Whisper's strengths and weaknesses and anylyze some output examples. For the human evaluation, we conducted a survey with 28 participants who were asked to evaluate Whisper's performance. All of our evaluations suggest that Whisper is a viable ASR system for Swiss German, so long as the Standard German output is desired.

Create account to get full access

Overview

The paper evaluates the performance of the Whisper automatic speech recognition (ASR) model on Swiss German dialects.
The researchers systematically tested Whisper's transcription accuracy using existing test sets and a new custom test set.
They conducted automatic, qualitative, and human evaluations to assess Whisper's strengths and weaknesses for Swiss German.

Plain English Explanation

The paper looks at how well the Whisper speech recognition model can handle Swiss German dialects, even though that wasn't part of the data it was trained on. The researchers ran a bunch of tests to see how Whisper performs, including:

Using standard metrics like word error rate (WER) and BLEU to automatically evaluate the transcription accuracy.
Doing a qualitative analysis to highlight Whisper's strengths and weaknesses, and looking at some example outputs.
Getting 28 people to evaluate Whisper's performance in a survey.

Overall, the results suggest that Whisper can be a useful speech recognition system for Swiss German, as long as you're okay with it outputting the transcription in standard German rather than the original Swiss German dialect.

Technical Explanation

The researchers evaluated the performance of the Whisper ASR model on Swiss German dialects, which were not part of the model's training data. They used three existing test sets - SwissDial, STT4SG-350, and the Swiss Parliaments Corpus - as well as a new custom test set based on mock clinical interviews.

For the automatic evaluation, they used standard metrics like word error rate (WER) and BLEU. The qualitative analysis looked at Whisper's strengths and weaknesses and provided example outputs. The human evaluation involved a survey with 28 participants.

Critical Analysis

The paper provides a thorough evaluation of Whisper's performance on Swiss German, addressing an important gap since Whisper was not trained on that data. The researchers used a diverse set of test sets and evaluation methods to gain a well-rounded understanding of Whisper's capabilities.

However, the paper does not delve into the potential reasons why Whisper performs well on Swiss German, despite not being trained on it. Further research could investigate the underlying factors that enable this cross-lingual transfer. Additionally, the human evaluation was limited to 28 participants, so a larger-scale study may provide more robust insights.

Conclusion

This study demonstrates that the Whisper ASR model can be a viable solution for transcribing Swiss German, provided the output in standard German is acceptable. The comprehensive evaluation approach used by the researchers offers valuable insights into Whisper's strengths and limitations for this task. These findings contribute to our understanding of the capabilities and limitations of state-of-the-art speech recognition models, which is crucial as they become more widely deployed.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Efficient Compression of Multitask Multilingual Speech Models

Thomas Palmeira Ferraz

Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. Despite that, we show that only model-related bias are amplified by quantization, impacting more low-resource languages and smaller models. Searching for a better compression approach, we propose DistilWhisper, an approach that is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.

5/3/2024

cs.CL cs.AI cs.SD eess.AS

🚀

Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults

Ahmed Adel Attia, Jing Liu, Wei Ai, Dorottya Demszky, Carol Espy-Wilson

Recent advancements in Automatic Speech Recognition (ASR) systems, exemplified by Whisper, have demonstrated the potential of these systems to approach human-level performance given sufficient data. However, this progress doesn't readily extend to ASR for children due to the limited availability of suitable child-specific databases and the distinct characteristics of children's speech. A recent study investigated leveraging the My Science Tutor (MyST) children's speech corpus to enhance Whisper's performance in recognizing children's speech. They were able to demonstrate some improvement on a limited testset. This paper builds on these findings by enhancing the utility of the MyST dataset through more efficient data preprocessing. We reduce the Word Error Rate (WER) on the MyST testset 13.93% to 9.11% with Whisper-Small and from 13.23% to 8.61% with Whisper-Medium and show that this improvement can be generalized to unseen datasets. We also highlight important challenges towards improving children's ASR performance. The results showcase the viable and efficient integration of Whisper for effective children's speech recognition.

5/16/2024

eess.AS cs.CL cs.SD

Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

Mohan Li, Simon Keizer, Rama Doddipatla

Zero-shot spoken language understanding (SLU) enables systems to comprehend user utterances in new domains without prior exposure to training data. Recent studies often rely on large language models (LLMs), leading to excessive footprints and complexity. This paper proposes the use of Whisper, a standalone speech processing model, for zero-shot end-to-end (E2E) SLU. To handle unseen semantic labels, SLU tasks are integrated into a question-answering (QA) framework, which prompts the Whisper decoder for semantics deduction. The system is efficiently trained with prefix-tuning, optimising a minimal set of parameters rather than the entire Whisper model. We show that the proposed system achieves a 40.7% absolute gain for slot filling (SLU-F1) on SLURP compared to a recently introduced zero-shot benchmark. Furthermore, it performs comparably to a Whisper-GPT-2 modular system under both in-corpus and cross-corpus evaluation settings, but with a relative 34.8% reduction in model parameters.

6/24/2024

eess.AS

Keyword-Guided Adaptation of Automatic Speech Recognition

Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet

Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model that leverages the Whisper encoder representation to dynamically generate prompts for guiding the decoder during the transcription process. We introduce two approaches to effectively steer the decoder towards these prompts: KG-Whisper, which is aimed at fine-tuning the Whisper decoder, and KG-Whisper-PT, which learns a prompt prefix. Our results show a significant improvement in the recognition accuracy of specified keywords and in reducing the overall word error rates. Specifically, in unseen language generalization, we demonstrate an average WER improvement of 5.1% over Whisper.

6/6/2024

eess.AS cs.LG cs.SD