Dirichlet Flow Matching with Applications to DNA Sequence Design

Read original: arXiv:2402.05841 - Published 6/3/2024 by Hannes Stark, Bowen Jing, Chenyu Wang, Gabriele Corso, Bonnie Berger, Regina Barzilay, Tommi Jaakkola

📶

Overview

The paper explores the use of discrete diffusion or flow models for faster and more controllable sequence generation, in contrast to autoregressive models.
The authors identify issues with a naive linear flow matching approach on the simplex, including discontinuities in the training target and other pathologies.
To address these challenges, the authors develop Dirichlet flow matching on the simplex based on mixtures of Dirichlet distributions as probability paths.
The paper presents a connection between the mixtures' scores and the flow's vector field, enabling both classifier and classifier-free guidance.
The authors also introduce distilled Dirichlet flow matching, which enables one-step sequence generation with minimal performance hits, resulting in linear speedups compared to autoregressive models.
The method is demonstrated on complex DNA sequence generation tasks, outperforming baselines in distributional metrics and achieving desired design targets for generated sequences.
The paper also shows that the classifier-free guidance approach improves unconditional generation and is effective for generating DNA that satisfies design targets.

Plain English Explanation

The researchers in this paper are looking for ways to generate sequences, such as DNA sequences, more quickly and with more control than the current state-of-the-art methods, which are called autoregressive models. Autoregressive models generate sequences one element at a time, which can be slow.

The researchers tried a simpler approach called "linear flow matching," but found that it had some issues, like the training targets being discontinuous and other problems. To fix this, they developed a new method called "Dirichlet flow matching" that uses a specific type of probability distribution called a Dirichlet distribution to represent the sequences.

This new method allows the researchers to connect the "scores" of the Dirichlet distributions to the "vector field" of the flow, which means they can either use a separate classifier model or avoid using one entirely (called "classifier-free guidance") to guide the sequence generation. This gives them more flexibility and control over the generated sequences.

The researchers also came up with a way to "distill" the Dirichlet flow matching method, which allows them to generate sequences in a single step instead of the slower autoregressive approach, resulting in a significant speed-up.

When the researchers tested this method on the task of generating complex DNA sequences, they found that it outperformed other existing methods in terms of how well the generated sequences matched the desired properties. They also showed that the classifier-free guidance approach improved the quality of the unconditionally generated sequences (without any specific targets).

Technical Explanation

The authors propose using discrete diffusion or flow models for faster and more controllable sequence generation compared to autoregressive models. They identify issues with a naive linear flow matching approach on the simplex, including discontinuities in the training target and other pathologies.

To address these challenges, the authors develop Dirichlet flow matching on the simplex based on mixtures of Dirichlet distributions as probability paths. This framework allows the authors to derive a connection between the mixtures' scores and the flow's vector field, enabling both classifier and classifier-free guidance.

Furthermore, the authors introduce distilled Dirichlet flow matching, which enables one-step sequence generation with minimal performance hits, resulting in $O(L)$ speedups compared to autoregressive models, where $L$ is the sequence length.

The authors demonstrate the efficacy of their approach on complex DNA sequence generation tasks, where they outperform all baselines in distributional metrics and in achieving desired design targets for the generated sequences. They also show that their classifier-free guidance approach improves unconditional generation and is effective for generating DNA that satisfies design targets.

The authors provide a theoretical analysis, showing that flow matching can achieve minimax-optimal convergence rates under certain conditions. They also discuss connections to Markovian flow matching, which can be used to accelerate MCMC-based sampling for continuous normalizing flows.

Critical Analysis

The paper presents a novel approach to discrete sequence generation using Dirichlet flow matching, which addresses some of the limitations of naive linear flow matching. The authors provide a thorough technical explanation and demonstrate promising results on DNA sequence generation tasks.

One potential concern is the reliance on the Dirichlet distribution as the underlying probability model. While this choice allows for the derivation of the vector field connection and enables the distilled generation approach, it may limit the expressive power of the model compared to more flexible distributions. It would be interesting to see if the general framework can be extended to other types of probability distributions.

Additionally, the paper focuses on DNA sequence generation, which has its own unique challenges and characteristics. It would be valuable to evaluate the method's performance on a broader range of sequence generation tasks, such as text or protein sequences, to assess its generalizability.

The authors mention potential issues with discontinuities in the training target and other pathologies of the naive linear flow matching approach. While the Dirichlet flow matching resolves these problems, a deeper analysis of the underlying causes and potential mitigation strategies could further strengthen the conceptual understanding and robustness of the proposed method.

Finally, the paper does not provide a comprehensive comparison to other state-of-the-art sequence generation techniques, such as transformer-based models or variational autoencoders. A more extensive benchmarking against a wider range of baselines would help to better position the contributions of the Dirichlet flow matching approach.

Conclusion

The paper introduces a novel Dirichlet flow matching approach for discrete sequence generation, which addresses the limitations of naive linear flow matching and enables faster and more controllable sequence generation compared to autoregressive models. The authors demonstrate the method's effectiveness on complex DNA sequence generation tasks and show the potential of classifier-free guidance for improving unconditional generation.

While the paper provides a strong technical foundation and promising results, further research is needed to explore the generalizability of the method, address potential limitations, and conduct more comprehensive comparisons to other state-of-the-art sequence generation techniques. Nevertheless, the Dirichlet flow matching framework represents an interesting and valuable contribution to the field of generative modeling over discrete spaces.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📶

Dirichlet Flow Matching with Applications to DNA Sequence Design

Hannes Stark, Bowen Jing, Chenyu Wang, Gabriele Corso, Bonnie Berger, Regina Barzilay, Tommi Jaakkola

Discrete diffusion or flow models could enable faster and more controllable sequence generation than autoregressive models. We show that naive linear flow matching on the simplex is insufficient toward this goal since it suffers from discontinuities in the training target and further pathologies. To overcome this, we develop Dirichlet flow matching on the simplex based on mixtures of Dirichlet distributions as probability paths. In this framework, we derive a connection between the mixtures' scores and the flow's vector field that allows for classifier and classifier-free guidance. Further, we provide distilled Dirichlet flow matching, which enables one-step sequence generation with minimal performance hits, resulting in $O(L)$ speedups compared to autoregressive models. On complex DNA sequence generation tasks, we demonstrate superior performance compared to all baselines in distributional metrics and in achieving desired design targets for generated sequences. Finally, we show that our classifier-free guidance approach improves unconditional generation and is effective for generating DNA that satisfies design targets. Code is available at https://github.com/HannesStark/dirichlet-flow-matching.

6/3/2024

📊

Fisher Flow Matching for Generative Modeling over Discrete Data

Oscar Davis, Samuel Kessler, Mircea Petrache, .Ismail .Ilkan Ceylan, Michael Bronstein, Avishek Joey Bose

Generative modeling over discrete data has recently seen numerous success stories, with applications spanning language modeling, biological sequence design, and graph-structured molecular data. The predominant generative modeling paradigm for discrete data is still autoregressive, with more recent alternatives based on diffusion or flow-matching falling short of their impressive performance in continuous data settings, such as image or video generation. In this work, we introduce Fisher-Flow, a novel flow-matching model for discrete data. Fisher-Flow takes a manifestly geometric perspective by considering categorical distributions over discrete data as points residing on a statistical manifold equipped with its natural Riemannian metric: the $textit{Fisher-Rao metric}$. As a result, we demonstrate discrete data itself can be continuously reparameterised to points on the positive orthant of the $d$-hypersphere $mathbb{S}^d_+$, which allows us to define flows that map any source distribution to target in a principled manner by transporting mass along (closed-form) geodesics of $mathbb{S}^d_+$. Furthermore, the learned flows in Fisher-Flow can be further bootstrapped by leveraging Riemannian optimal transport leading to improved training dynamics. We prove that the gradient flow induced by Fisher-Flow is optimal in reducing the forward KL divergence. We evaluate Fisher-Flow on an array of synthetic and diverse real-world benchmarks, including designing DNA Promoter, and DNA Enhancer sequences. Empirically, we find that Fisher-Flow improves over prior diffusion and flow-matching models on these benchmarks.

5/30/2024

📊

Discrete Flow Matching

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, Yaron Lipman

Despite Flow Matching and diffusion models having emerged as powerful generative paradigms for continuous variables such as images and videos, their application to high-dimensional discrete data, such as language, is still limited. In this work, we present Discrete Flow Matching, a novel discrete flow paradigm designed specifically for generating discrete data. Discrete Flow Matching offers several key contributions: (i) it works with a general family of probability paths interpolating between source and target distributions; (ii) it allows for a generic formula for sampling from these probability paths using learned posteriors such as the probability denoiser ($x$-prediction) and noise-prediction ($epsilon$-prediction); (iii) practically, focusing on specific probability paths defined with different schedulers considerably improves generative perplexity compared to previous discrete diffusion and flow models; and (iv) by scaling Discrete Flow Matching models up to 1.7B parameters, we reach 6.7% Pass@1 and 13.4% Pass@10 on HumanEval and 6.7% Pass@1 and 20.6% Pass@10 on 1-shot MBPP coding benchmarks. Our approach is capable of generating high-quality discrete data in a non-autoregressive fashion, significantly closing the gap between autoregressive models and discrete flow models.

7/23/2024

Mixed Continuous and Categorical Flow Matching for 3D De Novo Molecule Generation

Ian Dunn, David Ryan Koes

Deep generative models that produce novel molecular structures have the potential to facilitate chemical discovery. Diffusion models currently achieve state of the art performance for 3D molecule generation. In this work, we explore the use of flow matching, a recently proposed generative modeling framework that generalizes diffusion models, for the task of de novo molecule generation. Flow matching provides flexibility in model design; however, the framework is predicated on the assumption of continuously-valued data. 3D de novo molecule generation requires jointly sampling continuous and categorical variables such as atom position and atom type. We extend the flow matching framework to categorical data by constructing flows that are constrained to exist on a continuous representation of categorical data known as the probability simplex. We call this extension SimplexFlow. We explore the use of SimplexFlow for de novo molecule generation. However, we find that, in practice, a simpler approach that makes no accommodations for the categorical nature of the data yields equivalent or superior performance. As a result of these experiments, we present FlowMol, a flow matching model for 3D de novo generative model that achieves improved performance over prior flow matching methods, and we raise important questions about the design of prior distributions for achieving strong performance in flow matching models. Code and trained models for reproducing this work are available at https://github.com/dunni3/FlowMol

5/1/2024