MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

Read original: arXiv:2305.16958 - Published 5/28/2024 by Shiyue Zhang, Shijie Wu, Ozan Irsoy, Steven Lu, Mohit Bansal, Mark Dredze, David Rosenberg

🏋️

Overview

Autoregressive language models are trained to minimize the cross-entropy between the model distribution and the data distribution, which is equivalent to maximum likelihood estimation (MLE).
The authors observed that models trained this way may over-generalize and produce non-human-like text.
They propose an objective called MixCE that mixes the forward cross-entropy (from the model to the data) and the reverse cross-entropy (from the data to the model).
Evaluations on synthetic and real data show that models trained with MixCE generate better text without complex decoding strategies.

Plain English Explanation

Language models are trained to predict the next word in a sequence based on the previous words. This is done by minimizing the cross-entropy, which measures how different the model's predictions are from the actual words in the training data. In other words, the model is trying to learn to generate text that is as similar as possible to the training data.

However, the authors found that models trained this way may sometimes generate text that doesn't quite sound human-like. They believe this is because the standard cross-entropy objective doesn't fully capture how humans evaluate text.

To address this, the authors propose a new objective called MixCE, which combines the standard forward cross-entropy (from the model to the data) with a reverse cross-entropy (from the data to the model). The reverse cross-entropy is meant to better reflect how a human would judge the generated text.

The authors evaluate their MixCE approach on both synthetic data (where the true data distribution is known) and real-world data. They find that models trained with MixCE generate text that is more natural and human-like, without requiring complex decoding strategies.

Technical Explanation

Autoregressive language models are typically trained to minimize the forward cross-entropy between the model distribution Q and the data distribution P. This is equivalent to performing maximum likelihood estimation (MLE) to learn the model parameters.

The authors observed that models trained this way may over-generalize and produce text that does not resemble natural human language. They hypothesize that the reverse cross-entropy, i.e., the cross-entropy of P relative to Q, may be a better reflection of how a human would evaluate the generated text.

To this end, the authors propose an objective called MixCE, which combines the forward and reverse cross-entropies. They evaluate their approach on both synthetic data settings (where the true data distribution P is known) and real-world data. The results show that models trained with MixCE generate text that is more natural and human-like, without requiring complex decoding strategies.

Critical Analysis

The paper provides a thoughtful approach to improving the text generation capabilities of autoregressive language models. The key insight - that the reverse cross-entropy may better capture human evaluation of text quality - is an interesting and plausible hypothesis.

However, the paper does not delve deeply into the potential limitations or caveats of the MixCE approach. For example, it would be valuable to understand how the relative weighting of the forward and reverse cross-entropies affects the model's performance and the tradeoffs involved.

Additionally, the authors only evaluate their approach on a limited set of datasets and tasks. It would be helpful to see how MixCE-trained models perform on a wider range of real-world applications, such as data mixing or predictive modeling, to better understand its broader implications and potential limitations.

Overall, the paper presents an intriguing idea and initial results, but further research and analysis would be valuable to fully assess the merits and drawbacks of the MixCE approach.

Conclusion

The authors propose a novel objective called MixCE that combines the forward and reverse cross-entropies for training autoregressive language models. This approach is motivated by the observation that models trained with standard maximum likelihood estimation may over-generalize and produce text that does not sound human-like.

Evaluations on both synthetic and real-world data show that models trained with MixCE generate text that is more natural and coherent, without requiring complex decoding strategies. This suggests that the MixCE objective may be a promising direction for improving the text generation capabilities of language models.

If further research can address the potential limitations and expand the applications of this approach, the MixCE method could have significant implications for the development of more human-like and versatile language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

Shiyue Zhang, Shijie Wu, Ozan Irsoy, Steven Lu, Mohit Bansal, Mark Dredze, David Rosenberg

Autoregressive language models are trained by minimizing the cross-entropy of the model distribution Q relative to the data distribution P -- that is, minimizing the forward cross-entropy, which is equivalent to maximum likelihood estimation (MLE). We have observed that models trained in this way may over-generalize, in the sense that they produce non-human-like text. Moreover, we believe that reverse cross-entropy, i.e., the cross-entropy of P relative to Q, is a better reflection of how a human would evaluate text generated by a model. Hence, we propose learning with MixCE, an objective that mixes the forward and reverse cross-entropies. We evaluate models trained with this objective on synthetic data settings (where P is known) and real data, and show that the resulting models yield better generated text without complex decoding strategies. Our code and models are publicly available at https://github.com/bloomberg/mixce-acl2023

5/28/2024

Entropic Distribution Matching in Supervised Fine-tuning of LLMs: Less Overfitting and Better Diversity

Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Ruoyu Sun, Zhi-Quan Luo

Large language models rely on Supervised Fine-Tuning (SFT) to specialize in downstream tasks. Cross Entropy (CE) loss is the de facto choice in SFT, but it often leads to overfitting and limited output diversity due to its aggressive updates to the data distribution. This paper aim to address these issues by introducing the maximum entropy principle, which favors models with flatter distributions that still effectively capture the data. Specifically, we develop a new distribution matching method called GEM, which solves reverse Kullback-Leibler divergence minimization with an entropy regularizer. For the SFT of Llama-3-8B models, GEM outperforms CE in several aspects. First, when applied to the UltraFeedback dataset to develop general instruction-following abilities, GEM exhibits reduced overfitting, evidenced by lower perplexity and better performance on the IFEval benchmark. Furthermore, GEM enhances output diversity, leading to performance gains of up to 7 points on math reasoning and code generation tasks using best-of-n sampling, even without domain-specific data. Second, when fine-tuning with domain-specific datasets for math reasoning and code generation, GEM also shows less overfitting and improvements of up to 10 points compared with CE.

8/30/2024

Are LLM-based Recommenders Already the Best? Simple Scaled Cross-entropy Unleashes the Potential of Traditional Sequential Recommenders

Cong Xu, Zhangchi Zhu, Mo Yu, Jun Wang, Jianyong Wang, Wei Zhang

Large language models (LLMs) have been garnering increasing attention in the recommendation community. Some studies have observed that LLMs, when fine-tuned by the cross-entropy (CE) loss with a full softmax, could achieve `state-of-the-art' performance in sequential recommendation. However, most of the baselines used for comparison are trained using a pointwise/pairwise loss function. This inconsistent experimental setting leads to the underestimation of traditional methods and further fosters over-confidence in the ranking capability of LLMs. In this study, we provide theoretical justification for the superiority of the cross-entropy loss by demonstrating its two desirable properties: tightness and coverage. Furthermore, this study sheds light on additional novel insights: 1) Taking into account only the recommendation performance, CE is not yet optimal as it is not a quite tight bound in terms of some ranking metrics. 2) In scenarios that full softmax cannot be performed, an effective alternative is to scale up the sampled normalizing term. These findings then help unleash the potential of traditional recommendation models, allowing them to surpass LLM-based counterparts. Given the substantial computational burden, existing LLM-based methods are not as effective as claimed for sequential recommendation. We hope that these theoretical understandings in conjunction with the empirical results will facilitate an objective evaluation of LLM-based recommendation in the future.

8/27/2024

SimCE: Simplifying Cross-Entropy Loss for Collaborative Filtering

Xiaodong Yang, Huiyuan Chen, Yuchen Yan, Yuxin Tang, Yuying Zhao, Eric Xu, Yiwei Cai, Hanghang Tong

The learning objective is integral to collaborative filtering systems, where the Bayesian Personalized Ranking (BPR) loss is widely used for learning informative backbones. However, BPR often experiences slow convergence and suboptimal local optima, partially because it only considers one negative item for each positive item, neglecting the potential impacts of other unobserved items. To address this issue, the recently proposed Sampled Softmax Cross-Entropy (SSM) compares one positive sample with multiple negative samples, leading to better performance. Our comprehensive experiments confirm that recommender systems consistently benefit from multiple negative samples during training. Furthermore, we introduce a underline{Sim}plified Sampled Softmax underline{C}ross-underline{E}ntropy Loss (SimCE), which simplifies the SSM using its upper bound. Our validation on 12 benchmark datasets, using both MF and LightGCN backbones, shows that SimCE significantly outperforms both BPR and SSM.

6/26/2024