Preference Alignment Improves Language Model-Based TTS

Read original: arXiv:2409.12403 - Published 9/20/2024 by Jinchuan Tian, Chunlei Zhang, Jiatong Shi, Hao Zhang, Jianwei Yu, Shinji Watanabe, Dong Yu

💬

Overview

Researchers developed a method to improve the performance of language model-based text-to-speech (TTS) systems by aligning the model's preferences with human preferences.
The approach uses human feedback to fine-tune the language model, allowing it to generate speech that better matches user preferences.
Experiments showed this "preference alignment" strategy outperformed standard language model-based TTS on various objective and subjective metrics.

Plain English Explanation

The researchers wanted to make language model-based text-to-speech (TTS) systems sound more natural and pleasing to human listeners. They noticed that while these models can generate fluent speech, the output doesn't always match what people prefer to hear.

To address this, the researchers developed a technique called "preference alignment". The key idea is to fine-tune the language model using feedback from humans. By exposing the model to examples of speech that people rate as high-quality, the researchers were able to nudge the model's preferences to better align with human preferences.

In their experiments, the preference-aligned model outperformed standard language model-based TTS on both objective metrics (like audio quality) and subjective ratings from human listeners. This suggests the preference alignment approach can make TTS systems sound more natural and human-like.

Technical Explanation

The researchers used a language model-based TTS pipeline as their starting point. This involves using a large language model to generate text, which is then converted to audio using a standard TTS system.

To improve the model's performance, they fine-tuned the language model using a human feedback-based reinforcement learning approach. Specifically, they collected ratings from humans on the quality of the generated speech samples. They then used these preference labels to update the language model, encouraging it to generate speech that better matched human preferences.

Experiments showed this "preference alignment" strategy led to significant improvements in both objective audio quality metrics and subjective human ratings, compared to the baseline language model-based TTS system.

Critical Analysis

The researchers note that their preference alignment approach relies on having access to human preference labels, which may not always be available in practice. They suggest exploring ways to learn preferences from more implicit feedback, such as conversational interactions.

Additionally, while the experiments demonstrate the effectiveness of preference alignment, the researchers acknowledge that the assessments were conducted in a constrained setting. Further research is needed to evaluate the approach in more diverse, real-world scenarios and understand its broader implications.

Conclusion

This research presents a promising approach to improving the performance of language model-based TTS systems. By aligning the model's preferences with human preferences through fine-tuning, the researchers were able to generate speech that was rated as more natural and pleasing by listeners.

The findings suggest that incorporating human feedback can be a valuable strategy for enhancing the user experience of TTS technologies, which have numerous applications in areas like virtual assistants, audiobook narration, and speech-based user interfaces.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Preference Alignment Improves Language Model-Based TTS

Jinchuan Tian, Chunlei Zhang, Jiatong Shi, Hao Zhang, Jianwei Yu, Shinji Watanabe, Dong Yu

Recent advancements in text-to-speech (TTS) have shown that language model (LM)-based systems offer competitive performance to their counterparts. Further optimization can be achieved through preference alignment algorithms, which adjust LMs to align with the preferences of reward models, enhancing the desirability of the generated content. This study presents a thorough empirical evaluation of how preference alignment algorithms, particularly Direct Preference Optimization (DPO), enhance LM-based TTS. With a 1.15B parameter LM-based TTS model, we demonstrate that preference alignment consistently improves intelligibility, speaker similarity, and proxy subjective evaluation scores, with the latter two metrics surpassing even human speech in certain evaluations. We also show preference alignment is applicable to low-resource scenarios and effectively generalized to out-of-domain applications.

9/20/2024

Token-level Direct Preference Optimization

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, Jun Wang

Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.

9/2/2024

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment

Janghwan Lee, Seongmin Park, Sukjin Hong, Minsoo Kim, Du-Seong Chang, Jungwook Choi

The rapid advancement of large language models (LLMs) has facilitated their transformation into conversational chatbots that can grasp contextual nuances and generate pertinent sentences, closely mirroring human values through advanced techniques such as instruction tuning and reinforcement learning from human feedback (RLHF). However, the computational efficiency required for LLMs, achieved through techniques like post-training quantization (PTQ), presents challenges such as token-flipping that can impair chatbot performance. In response, we propose a novel preference alignment approach, quantization-aware direct preference optimization (QDPO), that aligns quantized LLMs with their full-precision counterparts, improving conversational abilities. Evaluated on two instruction-tuned LLMs in various languages, QDPO demonstrated superior performance in improving conversational abilities compared to established PTQ and knowledge-distillation fine-tuning techniques, marking a significant step forward in the development of efficient and effective conversational LLMs.

7/19/2024

New Desiderata for Direct Preference Optimization

Xiangkun Hu, Tong He, David Wipf

Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO) and followed by several notable descendants. Although effective in certain real-world settings, we introduce new evaluation criteria that serve to highlight unresolved shortcomings in the ability of existing DPO methods to interpolate between a pre-trained reference model and empirical measures of human preferences, as well as unavoidable trade-offs in how low- and high-quality responses are regularized and constraints are handled. Our insights then motivate an alternative DPO-like loss that provably mitigates these limitations. Empirical results serve to corroborate notable aspects of our analyses.

7/15/2024