Target Speaker ASR with Whisper

Read original: arXiv:2409.09543 - Published 9/17/2024 by Alexander Polok, Dominik Klement, Matthew Wiesner, Sanjeev Khudanpur, Jan v{C}ernock'y, Luk'av{s} Burget

Overview

The paper proposes a method for target speaker automatic speech recognition (ASR) using the Whisper language model.
It introduces a diarization conditioning technique to enable multi-speaker ASR and focus on a specific target speaker.
The method achieves state-of-the-art performance on target speaker ASR tasks.

Plain English Explanation

The paper describes a new way to use a powerful language model called Whisper to perform target speaker ASR. This means the system can identify and transcribe the speech of a specific person in a multi-speaker audio recording.

The key innovation is a "diarization conditioning" technique that allows the Whisper model to focus on a target speaker, rather than trying to transcribe all speakers at once. This makes the ASR task much easier and leads to significantly better performance.

The paper shows this target speaker ASR approach achieves state-of-the-art results on standard benchmark datasets. This could be very useful for applications like meeting transcription, virtual assistants, and audio analysis, where you want to isolate and transcribe the speech of a particular person.

Technical Explanation

The paper proposes a method for target speaker ASR using the Whisper language model. Whisper is a large, pre-trained model that can transcribe speech in a wide range of languages and acoustic conditions.

The key innovation is a "diarization conditioning" technique that allows the Whisper model to focus on a target speaker during the ASR process. Diarization is the task of identifying which audio segments belong to which speakers in a multi-speaker recording.

The method works by first performing speaker diarization to identify the different speakers in the audio. It then conditions the Whisper model on the target speaker's diarization information, effectively telling the model which parts of the audio to focus on.

This diarization conditioning allows the Whisper model to transcribe the speech of the target speaker more accurately, even in the presence of other speakers. The authors show this approach achieves state-of-the-art performance on popular target speaker ASR benchmarks.

Critical Analysis

The paper presents a novel and effective approach for target speaker ASR using the Whisper model. The key strength is the diarization conditioning technique, which elegantly solves the challenge of multi-speaker audio by allowing the model to focus on a specific target.

One potential limitation is that the method relies on having accurate speaker diarization information upfront. If the diarization step fails to correctly identify the target speaker, it could degrade the ASR performance. The authors acknowledge this and suggest further research into joint diarization and target speaker ASR.

Additionally, the paper does not explore the performance of the method on more challenging, real-world scenarios with highly overlapping speech, background noise, or accented speakers. Further testing in these more realistic conditions would help validate the approach's robustness.

Overall, the paper makes a valuable contribution to the target speaker ASR field by demonstrating a state-of-the-art technique using the powerful Whisper model. The diarization conditioning innovation is an important step towards making multi-speaker ASR more practical and useful.

Conclusion

The paper presents a new method for target speaker ASR that leverages the Whisper language model and a diarization conditioning technique. This approach allows the Whisper model to focus on transcribing the speech of a specific target speaker, even in the presence of other speakers.

The authors show this method achieves state-of-the-art performance on standard benchmarks, which could make it valuable for applications like meeting transcription, virtual assistants, and audio analysis. While the reliance on accurate diarization is a potential limitation, the overall innovation represents an important step forward for multi-speaker ASR.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Target Speaker ASR with Whisper

Alexander Polok, Dominik Klement, Matthew Wiesner, Sanjeev Khudanpur, Jan v{C}ernock'y, Luk'av{s} Burget

We propose a novel approach to enable the use of large, single speaker ASR models, such as Whisper, for target speaker ASR. The key insight of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs, than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization output type before the first transformer block can transform single speaker ASR models, into target speaker ASR models. Our target-speaker ASR model can be used for speaker attributed ASR by producing, in sequence, a transcript for each hypothesized speaker in a diarization output. This simplified model for speaker attributed ASR using only a single microphone outperforms cascades of speech separation and diarization by 11% absolute ORC-WER on the NOTSOFAR-1 dataset.

9/17/2024

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu, Xunying Liu, Helen Meng

Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enrollment speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks, and delivers acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.

8/27/2024

Keyword-Guided Adaptation of Automatic Speech Recognition

Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet

Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model that leverages the Whisper encoder representation to dynamically generate prompts for guiding the decoder during the transcription process. We introduce two approaches to effectively steer the decoder towards these prompts: KG-Whisper, which is aimed at fine-tuning the Whisper decoder, and KG-Whisper-PT, which learns a prompt prefix. Our results show a significant improvement in the recognition accuracy of specified keywords and in reducing the overall word error rates. Specifically, in unseen language generalization, we demonstrate an average WER improvement of 5.1% over Whisper.

6/6/2024

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

Tathagata Bandyopadhyay

Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based end-to-end model to extract a target speaker's speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two additional objectives to impose speaker embedding consistency and waveform encoder invertibility and jointly train both speaker encoder and speech separator to better capture the speaker conditional embedding. Furthermore, we leverage a multi-scale discriminator to refine the perceptual quality of the extracted speech. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points. Finally, we compare our approach with recent state-of-the-arts and show that our model outperforms existing methods by $4.1$ dB points on an average without creating additional data dependency.

9/4/2024