Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Read original: arXiv:2408.09027 - Published 8/20/2024 by Kai Qiu, Xiang Li, Hao Chen, Jie Sun, Jinglu Wang, Zhe Lin, Marios Savvides, Bhiksha Raj

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Overview

This paper provides instructions for authors using LaTeX to submit papers anonymously to AAAI Press.
It covers guidelines for preparing an anonymous submission, the camera-ready submission process, and copyright information.

Plain English Explanation

The paper outlines the steps authors should follow when submitting a paper to AAAI Press using the LaTeX typesetting system. The key focus is on ensuring the submission remains anonymous during the review process.

Some of the main points covered include:

Preparing an Anonymous Submission: How to remove identifying information from the paper and LaTeX source files to protect the authors' anonymity.
Camera-Ready Guidelines: Instructions for formatting the final, accepted version of the paper to AAAI's specifications.
Copyright: Information about the copyright transfer process and permissions for using copyrighted material.

The goal is to provide clear, step-by-step guidance to help authors successfully navigate the AAAI submission and publication process.

Technical Explanation

The paper outlines the technical requirements for submitting a paper to AAAI Press using LaTeX. It covers several key elements:

Anonymity: Authors must remove all identifying information from the paper and LaTeX source files to ensure the submission remains anonymous during the review process.
Camera-Ready Formatting: Once a paper is accepted, authors must format the final version to AAAI's specifications, including properly setting page margins, font sizes, and other layout elements.
Copyright: Authors must transfer copyright to AAAI Press and obtain permissions for any copyrighted material used in the paper.

The instructions are detailed and comprehensive, providing specific guidelines for each step of the submission and publication process.

Critical Analysis

The paper provides thorough and well-documented instructions for authors submitting to AAAI Press. The focus on anonymity during the review process is an important consideration to ensure a fair and unbiased evaluation.

However, the instructions are quite technical and detailed, which could be challenging for some authors, especially those new to the LaTeX typesetting system. The paper would benefit from additional examples or visual aids to help illustrate key concepts.

Additionally, the paper does not address any potential limitations or caveats of the submission process. It would be helpful for authors to have a better understanding of common issues or pitfalls to avoid.

Conclusion

This paper offers a comprehensive guide for authors using LaTeX to submit papers to AAAI Press. The detailed instructions cover the critical steps of ensuring anonymity, formatting the camera-ready version, and navigating the copyright process.

While the technical nature of the content may present some challenges, the paper provides a valuable resource for authors seeking to publish their work in AAAI conference proceedings. By following these guidelines, authors can help ensure a smooth and successful submission experience.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Kai Qiu, Xiang Li, Hao Chen, Jie Sun, Jinglu Wang, Zhe Lin, Marios Savvides, Bhiksha Raj

Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models. However, due to the naturally significant sequence length of audio, the efficiency of audio generation remains an essential issue to be addressed, especially for AR models that are incorporated in large language models (LLMs). In this paper, we analyze the token length of audio tokenization and propose a novel textbf{S}cale-level textbf{A}udio textbf{T}okenizer (SAT), with improved residual quantization. Based on SAT, a scale-level textbf{A}coustic textbf{A}utotextbf{R}egressive (AAR) modeling framework is further proposed, which shifts the next-token AR prediction to next-scale AR prediction, significantly reducing the training cost and inference time. To validate the effectiveness of the proposed approach, we comprehensively analyze design choices and demonstrate the proposed AAR framework achieves a remarkable textbf{35}$times$ faster inference speed and +textbf{1.33} Fr'echet Audio Distance (FAD) against baselines on the AudioSet benchmark. Code: url{https://github.com/qiuk2/AAR}.

8/20/2024

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, Haizhou Li

Audio language models have recently emerged as a promising approach for various audio generation tasks, relying on audio tokenizers to encode waveforms into sequences of discrete symbols. Audio tokenization often poses a necessary compromise between code bitrate and reconstruction accuracy. When dealing with low-bitrate audio codes, language models are constrained to process only a subset of the information embedded in the audio, which in turn restricts their generative capabilities. To circumvent these issues, we propose encoding audio as vector sequences in continuous space $mathbb R^d$ and autoregressively generating these sequences using a decoder-only diffusion transformer (ARDiT). Our findings indicate that ARDiT excels in zero-shot text-to-speech and exhibits performance that compares to or even surpasses that of state-of-the-art models. High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing. Our experiments reveal that employing Integral Kullback-Leibler (IKL) divergence for distillation at each autoregressive step significantly boosts the perceived quality of the samples. Simultaneously, it condenses the iterative sampling process of the diffusion model into a single step. Furthermore, ARDiT can be trained to predict several continuous vectors in one step, significantly reducing latency during sampling. Impressively, one of our models can generate $170$ ms of $24$ kHz speech per evaluation step with minimal degradation in performance. Audio samples are available at http://ardit-tts.github.io/ .

6/11/2024

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng

Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (textit{e.g.}, NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods, offering the following key advantages: (1) simplified data preparation; (2) straightforward model and loss design; and (3) stable, high-quality generation performance with fast inference speed. Compared to our previous publication, we present ({romannumeral1}) a detailed analysis of the influence of speech tokenizer and noisy label for TTS performance; ({romannumeral2}) four distinct types of sentence duration predictors; ({romannumeral3}) a novel flow-based scalar latent transformer diffusion model. With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models. Furthermore, we show that SimpleSpeech 2 can be seamlessly extended to multilingual TTS by training it on multilingual speech datasets. Demos are available on: {https://dongchaoyang.top/SimpleSpeech2_demo/}.

8/29/2024

🗣️

Parallel Synthesis for Autoregressive Speech Generation

Po-chun Hsu, Da-rong Liu, Andy T. Liu, Hung-yi Lee

Autoregressive neural vocoders have achieved outstanding performance in speech synthesis tasks such as text-to-speech and voice conversion. An autoregressive vocoder predicts a sample at some time step conditioned on those at previous time steps. Though it synthesizes natural human speech, the iterative generation inevitably makes the synthesis time proportional to the utterance length, leading to low efficiency. Many works were dedicated to generating the whole speech sequence in parallel and proposed GAN-based, flow-based, and score-based vocoders. This paper proposed a new thought for the autoregressive generation. Instead of iteratively predicting samples in a time sequence, the proposed model performs frequency-wise autoregressive generation (FAR) and bit-wise autoregressive generation (BAR) to synthesize speech. In FAR, a speech utterance is split into frequency subbands, and a subband is generated conditioned on the previously generated one. Similarly, in BAR, an 8-bit quantized signal is generated iteratively from the first bit. By redesigning the autoregressive method to compute in domains other than the time domain, the number of iterations in the proposed model is no longer proportional to the utterance length but to the number of subbands/bits, significantly increasing inference efficiency. Besides, a post-filter is employed to sample signals from output posteriors; its training objective is designed based on the characteristics of the proposed methods. Experimental results show that the proposed model can synthesize speech faster than real-time without GPU acceleration. Compared with baseline vocoders, the proposed model achieves better MUSHRA results and shows good generalization ability for unseen speakers and 44 kHz speech.

6/6/2024