Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

Read original: arXiv:2407.04482 - Published 7/8/2024 by Vyas Raina, Mark Gales

Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

Overview

This paper explores the use of universal acoustic adversarial attacks to control speech foundation models like Whisper.
The researchers demonstrate how these attacks can be used to manipulate the transcriptions produced by Whisper, potentially enabling malicious applications.
The paper presents a detailed technical explanation of the attack methodology and evaluates its effectiveness across various scenarios.
The authors also discuss potential countermeasures and areas for future research to address the security implications of these types of attacks.

Plain English Explanation

The researchers in this paper have found a way to trick speech recognition models, like the popular Whisper model, into producing inaccurate transcriptions. They do this by creating subtle audio distortions that are imperceptible to human listeners but can cause the model to output completely different text than what was actually said.

For example, if you were to play a recording of someone saying "the cat sat on the mat," the model might instead transcribe it as "the dog ran down the street." These types of adversarial attacks are called "universal" because they work across a wide range of audio inputs, not just specific ones.

The researchers show that these attacks can be used to control what the speech model outputs, potentially enabling malicious applications like bypassing voice authentication systems or inserting false transcripts. They also discuss ways that these attacks could be detected and defended against to improve the security and reliability of speech recognition technologies.

Technical Explanation

The core of the researchers' attack involves generating a small, imperceptible audio "perturbation" that can be added to any input speech signal. When the perturbed audio is fed into the target speech model (in this case, Whisper), it causes the model to output a transcription that the attackers have specified, rather than the original speech content.

The researchers use an optimization-based approach to craft these perturbations. They formulate the task as an optimization problem, where the goal is to find the smallest possible perturbation that will cause the target model to output a desired transcription. They solve this optimization problem using gradient-based techniques, leveraging the differentiability of the speech model to efficiently search for the optimal perturbation.

Through extensive experiments, the researchers demonstrate the effectiveness of their universal acoustic attacks across a variety of settings. They show that the attacks can work even when the target model is fine-tuned on specific domains, and that the perturbations are robust to various types of real-world audio distortions.

Critical Analysis

While the researchers' work highlights significant security vulnerabilities in current speech recognition systems, there are some important caveats to consider. First, the attacks require the attackers to have access to the target model and the ability to generate adversarial audio, which may not always be the case in real-world scenarios.

Additionally, the researchers primarily evaluate their attacks on the Whisper model, which is a large, general-purpose speech recognition system. It's unclear how well the attacks would translate to more specialized, domain-specific speech models, which may have different vulnerabilities and defense mechanisms.

The paper also does not explore potential countermeasures in depth. While the authors discuss some high-level approaches, such as adversarial training and input sanitization, more research is needed to develop robust and practical defenses against these types of attacks.

Conclusion

This paper demonstrates the alarming potential of universal acoustic adversarial attacks to manipulate the outputs of speech foundation models like Whisper. The researchers' work highlights the need for continued research into the security and robustness of these models, as they become increasingly ubiquitous in a wide range of applications.

By understanding the vulnerabilities exposed in this paper, researchers and practitioners can work to develop more secure and resilient speech recognition systems that are better equipped to withstand malicious attempts to control their outputs. Addressing these challenges will be crucial as speech-based technologies continue to play a larger role in our lives and the systems we rely on.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

Vyas Raina, Mark Gales

Speech enabled foundation models, either in the form of flexible speech recognition based systems or audio-prompted large language models (LLMs), are becoming increasingly popular. One of the interesting aspects of these models is their ability to perform tasks other than automatic speech recognition (ASR) using an appropriate prompt. For example, the OpenAI Whisper model can perform both speech transcription and speech translation. With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. Without any access to the model prompt it is possible to modify the behaviour of the system by appropriately changing the audio input. To illustrate this risk, we demonstrate that it is possible to prepend a short universal adversarial acoustic segment to any input speech signal to override the prompt setting of an ASR foundation model. Specifically, we successfully use a universal adversarial acoustic segment to control Whisper to always perform speech translation, despite being set to perform speech transcription. Overall, this work demonstrates a new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model.

7/8/2024

$Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models$

Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models

Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, Mark Gales

Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate `special tokens' in their vocabulary, such as $texttt{}$, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $texttt{}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.

7/18/2024

Self-Supervised Models in Automatic Whispered Speech Recognition

Aref Farhadipour, Homa Asadi, Volker Dellwo

In automatic speech recognition, any factor that alters the acoustic properties of speech can pose a challenge to the system's performance. This paper presents a novel approach for automatic whispered speech recognition in the Irish dialect using the self-supervised WavLM model. Conventional automatic speech recognition systems often fail to accurately recognise whispered speech due to its distinct acoustic properties and the scarcity of relevant training data. To address this challenge, we utilized a pre-trained WavLM model, fine-tuned with a combination of whispered and normal speech data from the wTIMIT and CHAINS datasets, which include the English language in Singaporean and Irish dialects, respectively. Our baseline evaluation with the OpenAI Whisper model highlighted its limitations, achieving a Word Error Rate (WER) of 18.8% on whispered speech. In contrast, the proposed WavLM-based system significantly improved performance, achieving a WER of 9.22%. These results demonstrate the efficacy of our approach in recognising whispered speech and underscore the importance of tailored acoustic modeling for robust automatic speech recognition systems. This study provides valuable insights into developing effective automatic speech recognition solutions for challenging speech affected by whisper and dialect. The source codes for this paper are freely available.

8/1/2024

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu, Xunying Liu, Helen Meng

Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enrollment speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks, and delivers acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.

8/27/2024