dipeptide
Path Gradients after Flow Matching
Boltzmann Generators have emerged as a promising machine learning tool for generating samples from equilibrium distributions of molecular systems using Normalizing Flows and importance weighting. Recently, Flow Matching has helped speed up Continuous Normalizing Flows (CNFs), scale them to more complex molecular systems, and minimize the length of the flow integration trajectories. We investigate the benefits of using Path Gradients to fine-tune CNFs initially trained by Flow Matching, in a setting where the target energy is known. Our experiments show that this hybrid approach yields up to a threefold increase in sampling efficiency for molecular systems, all while using the same model, a similar computational budget and without the need for additional sampling. Furthermore, by measuring the length of the flow trajectories during fine-tuning, we show that Path Gradients largely preserve the learned structure of the flow.
Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models
Michael Plainer, Hao Wu, Leon Klein, Stephan Gรผnnemann, Frank Noรฉ
In recent years, diffusion models trained on equilibrium molecular distributions have proven effective for sampling biomolecules. Beyond direct sampling, the score of such a model can also be used to derive the forces that act on molecular systems. However, while classical diffusion sampling usually recovers the training distribution, the corresponding energy-based interpretation of the learned score is often inconsistent with this distribution, even for low-dimensional toy systems. We trace this inconsistency to inaccuracies of the learned score at very small diffusion timesteps, where the model must capture the correct evolution of the data distribution. In this regime, diffusion models fail to satisfy the Fokker-Planck equation, which governs the evolution of the score. We interpret this deviation as one source of the observed inconsistencies and propose an energy-based diffusion model with a Fokker-Planck-derived regularization term to enforce consistency. We demonstrate our approach by sampling and simulating multiple biomolecular systems, including fast-folding proteins, and by introducing a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation and achieves improved consistency and efficient sampling.
FromBiasedtoUnbiasedDynamics: AnInfinitesimalGeneratorApproach
Toovercome this bottleneck, data are collected via biased simulations that explore the state space more rapidly. Wepropose aframeworkforlearning frombiased simulations rooted in the infinitesimal generator of the process and the associated resolvent operator. Wecontrast our approach to more common ones based on the transfer operator, showing thatitcanprovably learn thespectral properties oftheunbiased system frombiaseddata.
BoltzNCE: Learning Likelihoods for Boltzmann Generation with Stochastic Interpolants and Noise Contrastive Estimation
Aggarwal, Rishal, Chen, Jacky, Boffi, Nicholas M., Koes, David Ryan
Efficient sampling from the Boltzmann distribution given its energy function is a key challenge for modeling complex physical systems such as molecules. Boltzmann Generators address this problem by leveraging continuous normalizing flows to transform a simple prior into a distribution that can be reweighted to match the target using sample likelihoods. Despite the elegance of this approach, obtaining these likelihoods requires computing costly Jacobians during integration, which is impractical for large molecular systems. To overcome this difficulty, we train an energy-based model (EBM) to approximate likelihoods using both noise contrastive estimation (NCE) and score matching, which we show outperforms the use of either objective in isolation. On 2d synthetic systems where failure can be easily visualized, NCE improves mode weighting relative to score matching alone. On alanine dipeptide, our method yields free energy profiles and energy distributions that closely match those obtained using exact likelihoods while achieving $100\times$ faster inference. By training on multiple dipeptide systems, we show that our approach also exhibits effective transfer learning, generalizing to new systems at inference time and achieving at least a $6\times$ speedup over standard MD. While many recent efforts in generative modeling have prioritized models with fast sampling, our work demonstrates the design of models with accelerated likelihoods, enabling the application of reweighting schemes that ensure unbiased Boltzmann statistics at scale. Our code is available at https://github.com/RishalAggarwal/BoltzNCE.
PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding (Supplementary Material)
For example, the feature of dipeptide " st " is defined by its dipeptide composition ( The Moran feature descriptor defines the distribution of amino acid properties along a protein sequence. It should be noted that there are evident class imbalances in two multi-class classification tasks. Table 1: Balanced metric (weighted F1) compared with accuracy on multi-class classification tasks. We report mean (std) for each experiment. Used as a feature extractor with pre-trained weights frozen.