Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

Read original: arXiv:2408.10463 - Published 8/21/2024 by Hyun Jin Park, Dhruuv Agarwal, Neng Chen, Rentao Sun, Kurt Partridge, Justin Chen, Harry Zhang, Pai Zhu, Jacob Bartel, Kyle Kastner and 3 others

Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

Overview

This paper explores using adversarial training to improve keyword spotting models trained on text-to-speech (TTS) generated data, reducing the risk of overfitting to TTS artifacts.
The researchers developed a two-stage training process that first trains the keyword spotting model on a combination of real and TTS-generated speech, and then fine-tunes it using an adversarial training approach.
The goal is to make the model more robust to the characteristics of synthetic speech and better generalize to real-world speech data.

Plain English Explanation

Keyword spotting models are used to detect specific words or phrases in spoken audio. These models are often trained on large datasets of speech recordings. However, collecting real human speech recordings can be expensive and time-consuming.

To address this, researchers have started using text-to-speech (TTS) technology to generate synthetic speech data for training keyword spotting models. While this can be a cost-effective approach, there is a risk that the models may become over-reliant on the unique characteristics of the synthetic speech, and not perform as well on real-world speech.

The paper describes an approach that uses ,[object Object] to make the keyword spotting model more robust to the differences between real and synthetic speech. The key idea is to train the model in two stages:

First, the model is trained on a mix of real and TTS-generated speech data. This allows it to learn the general patterns of speech.
Then, the model undergoes
adversarial training
. This means it is trained to recognize when the input speech is real or synthetic, in addition to detecting the target keywords. This forces the model to learn features that are more generalizable to real-world speech, rather than relying too heavily on TTS-specific artifacts.

The goal of this two-stage approach is to create a keyword spotting model that performs well on real-world speech, even though it was trained using a significant amount of synthetic data. This can make the development of these models more efficient and cost-effective.

Technical Explanation

The paper proposes a two-stage training approach for keyword spotting models to address the issue of overfitting to TTS-generated data:

Initial Training: The keyword spotting model is first trained on a mix of real speech recordings and TTS-generated speech. This allows the model to learn the general characteristics of speech and the target keywords.
Adversarial Fine-tuning: In the second stage, the model undergoes adversarial training. A discriminator network is introduced that tries to classify the input speech as either real or synthetic. The keyword spotting model is then trained to not only detect the target keywords, but also to fool the discriminator into thinking the input is real speech.

This adversarial training process encourages the keyword spotting model to learn features that are more generalizable to real-world speech, rather than relying on TTS-specific artifacts. The researchers hypothesize that this will improve the model's performance on real speech data, even though it was trained on a significant amount of synthetic data.

The researchers evaluate their approach on two keyword spotting benchmarks, comparing the performance of models trained with and without the adversarial fine-tuning stage. The results show that the adversarial training approach leads to improved keyword spotting accuracy on real speech data, demonstrating the effectiveness of this method in reducing overfitting to TTS artifacts.

Critical Analysis

The paper presents a promising approach to improving the robustness of keyword spotting models trained on TTS-generated data. The key strength of the proposed method is the use of adversarial training to force the model to learn more generalizable features, rather than overfitting to the characteristics of synthetic speech.

One potential limitation is that the effectiveness of the adversarial fine-tuning stage may depend on the quality and diversity of the TTS data used. If the synthetic speech data has significant differences from real-world speech, the adversarial training may not be as effective in bridging that gap.

Additionally, the paper does not investigate how the proposed approach would perform in more complex, real-world scenarios, such as with background noise or multiple speakers. Further research is needed to understand the limitations and potential edge cases of this method.

Overall, the paper makes a valuable contribution by demonstrating a practical way to leverage synthetic speech data for keyword spotting while mitigating the risk of overfitting. The adversarial training approach is an interesting and promising direction for improving the robustness of speech recognition models in resource-constrained settings.

Conclusion

This paper presents an adversarial training approach to improve keyword spotting models trained on text-to-speech (TTS) generated data. By incorporating a discriminator network that tries to classify the input as real or synthetic, the keyword spotting model is encouraged to learn more generalizable features that are less reliant on TTS-specific artifacts.

The results show that this two-stage training process, with an initial training on a mix of real and synthetic speech followed by adversarial fine-tuning, can lead to improved keyword spotting accuracy on real-world speech data. This is a valuable contribution towards making the development of these models more efficient and cost-effective, while maintaining high performance on real-world applications.

Further research is needed to understand the limitations of this approach and how it might perform in more complex, real-world scenarios. However, the use of adversarial training to improve the robustness of speech recognition models is an interesting and promising direction for the field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

Hyun Jin Park, Dhruuv Agarwal, Neng Chen, Rentao Sun, Kurt Partridge, Justin Chen, Harry Zhang, Pai Zhu, Jacob Bartel, Kyle Kastner, Gary Wang, Andrew Rosenberg, Quan Wang

The keyword spotting (KWS) problem requires large amounts of real speech training data to achieve high accuracy across diverse populations. Utilizing large amounts of text-to-speech (TTS) synthesized data can reduce the cost and time associated with KWS development. However, TTS data may contain artifacts not present in real speech, which the KWS model can exploit (overfit), leading to degraded accuracy on real speech. To address this issue, we propose applying an adversarial training method to prevent the KWS model from learning TTS-specific features when trained on large amounts of TTS data. Experimental results demonstrate that KWS model accuracy on real speech data can be improved by up to 12% when adversarial loss is used in addition to the original KWS loss. Surprisingly, we also observed that the adversarial setup improves accuracy by up to 8%, even when trained solely on TTS and real negative speech data, without any real positive examples.

8/21/2024

Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model

Hyun Jin Park, Dhruuv Agarwal, Neng Chen, Rentao Sun, Kurt Partridge, Justin Chen, Harry Zhang, Pai Zhu, Jacob Bartel, Kyle Kastner, Gary Wang, Andrew Rosenberg, Quan Wang

This paper explores the use of TTS synthesized training data for KWS (keyword spotting) task while minimizing development cost and time. Keyword spotting models require a huge amount of training data to be accurate, and obtaining such training data can be costly. In the current state of the art, TTS models can generate large amounts of natural-sounding data, which can help reducing cost and time for KWS model development. Still, TTS generated data can be lacking diversity compared to real data. To pursue maximizing KWS model accuracy under the constraint of limited resources and current TTS capability, we explored various strategies to mix TTS data and real human speech data, with a focus on minimizing real data use and maximizing diversity of TTS output. Our experimental results indicate that relatively small amounts of real audio data with speaker diversity (100 speakers, 2k utterances) and large amounts of TTS synthesized data can achieve reasonably high accuracy (within 3x error rate of baseline), compared to the baseline (trained with 3.8M real positive utterances).

7/29/2024

Disentangled Training with Adversarial Examples For Robust Small-footprint Keyword Spotting

Zhenyu Wang, Li Wan, Biqiao Zhang, Yiteng Huang, Shang-Wen Li, Ming Sun, Xin Lei, Zhaojun Yang

A keyword spotting (KWS) engine that is continuously running on device is exposed to various speech signals that are usually unseen before. It is a challenging problem to build a small-footprint and high-performing KWS model with robustness under different acoustic environments. In this paper, we explore how to effectively apply adversarial examples to improve KWS robustness. We propose datasource-aware disentangled learning with adversarial examples to reduce the mismatch between the original and adversarial data as well as the mismatch across original training datasources. The KWS model architecture is based on depth-wise separable convolution and a simple attention module. Experimental results demonstrate that the proposed learning strategy improves false reject rate by $40.31%$ at $1%$ false accept rate on the internal dataset, compared to the strongest baseline without using adversarial examples. Our best-performing system achieves $98.06%$ accuracy on the Google Speech Commands V1 dataset.

8/27/2024

Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments

Pai Zhu, Dhruuv Agarwal, Jacob W. Bartel, Kurt Partridge, Hyun Jin Park, Quan Wang

One of the challenges in developing a high quality custom keyword spotting (KWS) model is the lengthy and expensive process of collecting training data covering a wide range of languages, phrases and speaking styles. We introduce Synth4Kws - a framework to leverage Text to Speech (TTS) synthesized data for custom KWS in different resource settings. With no real data, we found increasing TTS phrase diversity and utterance sampling monotonically improves model performance, as evaluated by EER and AUC metrics over 11k utterances of the speech command dataset. In low resource settings, with 50k real utterances as a baseline, we found using optimal amounts of TTS data can improve EER by 30.1% and AUC by 46.7%. Furthermore, we mix TTS data with varying amounts of real data and interpolate the real data needed to achieve various quality targets. Our experiments are based on English and single word utterances but the findings generalize to i18n languages and other keyword types.

7/25/2024