Simplified and Generalized Masked Diffusion for Discrete Data

2406.04329

Published 6/7/2024 by Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, Michalis K. Titsias

Simplified and Generalized Masked Diffusion for Discrete Data

Abstract

Masked (or absorbing) diffusion is actively explored as an alternative to autoregressive models for generative modeling of discrete data. However, existing work in this area has been hindered by unnecessarily complex model formulations and unclear relationships between different perspectives, leading to suboptimal parameterization, training objectives, and ad hoc adjustments to counteract these issues. In this work, we aim to provide a simple and general framework that unlocks the full potential of masked diffusion models. We show that the continuous-time variational objective of masked diffusion models is a simple weighted integral of cross-entropy losses. Our framework also enables training generalized masked diffusion models with state-dependent masking schedules. When evaluated by perplexity, our models trained on OpenWebText surpass prior diffusion language models at GPT-2 scale and demonstrate superior performance on 4 out of 5 zero-shot language modeling tasks. Furthermore, our models vastly outperform previous discrete diffusion models on pixel-level image modeling, achieving 2.78~(CIFAR-10) and 3.42 (ImageNet 64$times$64) bits per dimension that are comparable or better than autoregressive models of similar sizes.

Create account to get full access

Overview

This paper introduces a simplified and generalized version of masked diffusion, a technique for training discrete data models using self-supervised learning.
The authors propose modifications to the standard masked diffusion approach to make it more efficient and flexible, allowing it to be applied to a wider range of discrete data types.
The paper demonstrates the effectiveness of the proposed approach through experiments on text, image, and audio datasets, showing improved performance compared to previous masked diffusion models.

Plain English Explanation

The paper describes a way to train models to work with discrete data, such as text, images, or audio, using a technique called masked diffusion. Discrete data refers to data that can only take on a limited set of values, like the individual letters, words, or characters that make up text.

The authors found that the standard masked diffusion approach had some limitations, so they developed a simplified and more generalized version of the technique. The key ideas are:

Simplification: The authors made the masked diffusion process simpler and more efficient, while still maintaining its effectiveness.
Generalization: The new approach can be applied to a wider range of discrete data types, beyond just text, making it more versatile.

By testing their method on various datasets, the authors show that their simplified and generalized masked diffusion model outperforms previous versions, demonstrating the value of their innovations.

Technical Explanation

The paper introduces a simplified and generalized masked diffusion approach for training models on discrete data. Masked diffusion is a self-supervised learning technique that has been used for tasks like text generation and image synthesis.

The authors identify several limitations in the standard masked diffusion approach and propose modifications to address them:

Simplification: The authors simplify the masked diffusion process by using a single set of diffusion parameters, instead of separate sets for the masked and unmasked tokens. This reduces the number of parameters and makes the model more efficient.
Generalization: The authors generalize the masked diffusion approach to work with a wider range of discrete data types, beyond just text. This is achieved by using a flexible token embedding scheme that can be adapted to different data domains, such as images or audio.

The authors evaluate their simplified and generalized masked diffusion model on various datasets, including text, images, and audio. The results show that their approach outperforms previous masked diffusion models, demonstrating the benefits of the proposed modifications.

Critical Analysis

The paper presents a solid contribution to the field of discrete data modeling, but there are a few potential limitations and areas for further research:

Generalization Scope: While the authors claim their approach can be applied to a wider range of discrete data types, the paper only demonstrates experiments on a limited set of domains (text, images, and audio). Further research is needed to validate the generalization capabilities of the model across a broader range of discrete data applications.
Computational Efficiency: The authors focus on simplifying the masked diffusion process to improve efficiency, but they do not provide a detailed analysis of the computational costs or memory requirements of their model compared to previous approaches. This information would be valuable for assessing the practical implications of using the simplified and generalized masked diffusion model in real-world scenarios.
Interpretability and Explainability: The paper does not discuss the interpretability or explainability of the learned representations in the simplified and generalized masked diffusion model. Understanding the internal workings and decision-making processes of such models is important for building trust and ensuring responsible deployment in various applications.

Conclusion

This paper presents a simplified and generalized version of the masked diffusion approach for training models on discrete data. The authors' key innovations, including simplifying the diffusion process and expanding the model's applicability to a wider range of discrete data types, have demonstrated improved performance compared to previous masked diffusion models.

While the paper makes a valuable contribution to the field, further research is needed to fully explore the generalization capabilities, computational efficiency, and interpretability of the proposed approach. Overall, the work represents an important step forward in advancing the state-of-the-art in discrete data modeling and self-supervised learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Simple and Effective Masked Diffusion Language Models

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, Volodymyr Kuleshov

While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods in language modeling. In this work, we show that simple masked discrete diffusion is more performant than previously thought. We apply an effective training recipe that improves the performance of masked diffusion models and derive a simplified, Rao-Blackwellized objective that results in additional improvements. Our objective has a simple form -- it is a mixture of classical masked language modeling losses -- and can be used to train encoder-only language models that admit efficient samplers, including ones that can generate arbitrary lengths of text semi-autoregressively like a traditional language model. On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. We release our code at: https://github.com/kuleshov-group/mdlm

6/12/2024

cs.CL cs.AI cs.LG

➖

Masked Diffusion as Self-supervised Representation Learner

Zixuan Pan, Jianxu Chen, Yiyu Shi

Denoising diffusion probabilistic models have recently demonstrated state-of-the-art generative performance and have been used as strong pixel-level representation learners. This paper decomposes the interrelation between the generative capability and representation learning ability inherent in diffusion models. We present the masked diffusion model (MDM), a scalable self-supervised representation learner for semantic segmentation, substituting the conventional additive Gaussian noise of traditional diffusion with a masking mechanism. Our proposed approach convincingly surpasses prior benchmarks, demonstrating remarkable advancements in both medical and natural image semantic segmentation tasks, particularly in few-shot scenarios.

4/16/2024

cs.CV

📊

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, Stefano Ermon

Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel loss that naturally extends score matching to discrete spaces, integrates seamlessly to build discrete diffusion models, and significantly boosts performance. Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. For comparable model sizes, SEDD beats existing language diffusion paradigms (reducing perplexity by $25$-$75$%) and is competitive with autoregressive models, in particular outperforming GPT-2. Furthermore, compared to autoregressive mdoels, SEDD generates faithful text without requiring distribution annealing techniques like temperature scaling (around $6$-$8times$ better generative perplexity than un-annealed GPT-2), can trade compute and quality (similar quality with $32times$ fewer network evaluations), and enables controllable infilling (matching nucleus sampling quality while enabling other strategies besides left to right prompting).

6/10/2024

stat.ML cs.CL cs.LG

Improving Discrete Diffusion Models via Structured Preferential Generation

Severi Rissanen, Markus Heinonen, Arno Solin

In the domains of image and audio, diffusion models have shown impressive performance. However, their application to discrete data types, such as language, has often been suboptimal compared to autoregressive generative models. This paper tackles the challenge of improving discrete diffusion models by introducing a structured forward process that leverages the inherent information hierarchy in discrete categories, such as words in text. Our approach biases the generative process to produce certain categories before others, resulting in a notable improvement in log-likelihood scores on the text8 dataset. This work paves the way for more advances in discrete diffusion models with potentially significant enhancements in performance.

5/29/2024

cs.LG