WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Read original: arXiv:2408.16532 - Published 8/30/2024 by Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang and 6 others

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Overview

The paper introduces a novel tokenizer called "wavtokenizer" for acoustic discrete codec representation of audio for language modeling.
It aims to efficiently encode raw audio signals into a compact discrete representation for downstream tasks like speech recognition and text-to-speech.
The wavtokenizer leverages vector quantization and self-supervised learning to achieve high accuracy and compression efficiency.

Plain English Explanation

The paper presents a new way to represent audio data called "wavtokenizer" that can be used for tasks like speech recognition and text-to-speech. Acoustic discrete codec is a method of converting raw audio signals into a compact set of discrete codes or "tokens" that capture the essential features of the audio.

The key idea behind wavtokenizer is to use vector quantization and self-supervised learning to efficiently encode the audio into these discrete tokens. This allows the audio data to be represented in a much more compressed form than the original raw waveform, while still preserving the important acoustic information.

By using this more efficient audio encoding, the authors aim to improve the performance of downstream machine learning models that work with audio data, such as speech recognition and text-to-speech systems. The compact representation can also enable faster processing and lower memory requirements compared to using the full waveform directly.

Technical Explanation

The key technical components of the wavtokenizer approach are:

Vector Quantization: The raw audio waveform is split into short frames, which are then mapped to a discrete set of learned acoustic tokens using a vector quantization module. This allows the continuous audio signal to be represented as a sequence of discrete codes.
Self-Supervised Learning: The vector quantization codebook is trained in a self-supervised manner, without relying on any labeled data. The model learns to predict the discrete tokens that best reconstruct the original audio, which helps it capture the essential acoustic features.
Efficient Architecture: The authors designed an efficient neural network architecture for the wavtokenizer that can operate directly on raw waveform data, without requiring any additional feature extraction steps. This allows the model to be easily integrated into end-to-end speech and audio processing pipelines.

The paper evaluates the wavtokenizer on a range of benchmark tasks, including speech recognition, text-to-speech, and audio classification. The results demonstrate that the compact discrete representation learned by the wavtokenizer can match or outperform models that use more traditional audio features, while also being more efficient in terms of computational and memory requirements.

Critical Analysis

The paper provides a thorough technical explanation of the wavtokenizer approach and presents convincing experimental results to support its effectiveness. However, a few potential limitations or areas for further research are worth noting:

Generalization to Diverse Audio Domains: The experiments in the paper focus on relatively clean speech data. It would be interesting to see how well the wavtokenizer performs on more diverse audio data, such as music, environmental sounds, or conversational speech with background noise.
Interpretability of Discrete Tokens: While the discrete representation learned by the wavtokenizer is efficient, it may be challenging to interpret the meaning of the individual tokens. Further analysis of the learned codebook could provide insights into the acoustic features captured by the model.
Comparison to Other Discrete Audio Representations: The paper compares the wavtokenizer to traditional audio features, but it would be valuable to see how it performs relative to other discrete audio encoding methods, such as HuBERT or SpeechT5.

Overall, the wavtokenizer represents an interesting and promising approach to efficient acoustic representation learning, with potential applications in a variety of speech and audio-based machine learning tasks.

Conclusion

The wavtokenizer paper introduces a novel method for encoding raw audio signals into a compact discrete representation using vector quantization and self-supervised learning. This efficient acoustic tokenization can benefit a range of downstream applications, such as speech recognition and text-to-speech, by providing a more compressed input representation that preserves the essential acoustic features.

The experimental results demonstrate the effectiveness of the wavtokenizer approach, and the paper provides a solid technical foundation for further research and development in this area. As the field of audio AI continues to evolve, techniques like the wavtokenizer will likely play an important role in enabling more efficient and versatile speech and audio processing systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Wen Wang, Zhou Zhao

Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at https://github.com/jishengpeng/WavTokenizer.

8/30/2024

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli

Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers, enhancing the adaptability and performance of semantic tokens in diverse audio applications.

6/18/2024

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Yiwei Guo, Zhihan Li, Junjie Li, Chenpeng Du, Hankun Wang, Shuai Wang, Xie Chen, Kai Yu

We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis.

9/12/2024

LAST: Language Model Aware Speech Tokenization

Arnon Turetzky, Yossi Adi

Speech tokenization serves as the foundation of speech language model (LM), enabling them to perform various tasks such as spoken language modeling, text-to-speech, speech-to-text, etc. Most speech tokenizers are trained independently of the LM training process, relying on separate acoustic models and quantization methods. Following such an approach may create a mismatch between the tokenization process and its usage afterward. In this study, we propose a novel approach to training a speech tokenizer by leveraging objectives from pre-trained textual LMs. We advocate for the integration of this objective into the process of learning discrete speech representations. Our aim is to transform features from a pre-trained speech model into a new feature space that enables better clustering for speech LMs. We empirically investigate the impact of various model design choices, including speech vocabulary size and text LM size. Our results demonstrate the proposed tokenization method outperforms the evaluated baselines considering both spoken language modeling and speech-to-text. More importantly, unlike prior work, the proposed method allows the utilization of a single pre-trained LM for processing both speech and text inputs, setting it apart from conventional tokenization approaches.

9/11/2024