Wang, Zun
Enhancing the Scalability and Applicability of Kohn-Sham Hamiltonians for Molecular Systems
Li, Yunyang, Xia, Zaishuo, Huang, Lin, Wei, Xinran, Yang, Han, Harshe, Sam, Wang, Zun, Liu, Chang, Zhang, Jia, Shao, Bin, Gerstein, Mark B.
Density Functional Theory (DFT) is a pivotal method within quantum chemistry and materials science, with its core involving the construction and solution of the Kohn-Sham Hamiltonian. Despite its importance, the application of DFT is frequently limited by the substantial computational resources required to construct the Kohn-Sham Hamiltonian. In response to these limitations, current research has employed deep-learning models to efficiently predict molecular and solid Hamiltonians, with roto-translational symmetries encoded in their neural networks. However, the scalability of prior models may be problematic when applied to large molecules, resulting in non-physical predictions of ground-state properties. In this study, we generate a substantially larger training set (PubChemQH) than used previously and use it to create a scalable model for DFT calculations with physical accuracy. For our model, we introduce a loss function derived from physical principles, which we call Wavefunction Alignment Loss (WALoss). WALoss involves performing a basis change on the predicted Hamiltonian to align it with the observed one; thus, the resulting differences can serve as a surrogate for orbital energy differences, allowing models to make better predictions for molecular orbitals and total energies than previously possible. WALoss also substantially accelerates self-consistent-field (SCF) DFT calculations. Here, we show it achieves a reduction in total energy prediction error by a factor of 1347 and an SCF calculation speed-up by a factor of 18%. These substantial improvements set new benchmarks for achieving accurate and applicable predictions in larger molecular systems.
Potential Score Matching: Debiasing Molecular Structure Sampling with Potential Energy Guidance
Guo, Liya, Wang, Zun, Liu, Chang, Li, Junzhe, Hu, Pipi, Zhu, Yi
The ensemble average of physical properties of molecules is closely related to the distribution of molecular conformations, and sampling such distributions is a fundamental challenge in physics and chemistry. Traditional methods like molecular dynamics (MD) simulations and Markov chain Monte Carlo (MCMC) sampling are commonly used but can be time-consuming and costly. Recently, diffusion models have emerged as efficient alternatives by learning the distribution of training data. Obtaining an unbiased target distribution is still an expensive task, primarily because it requires satisfying ergodicity. To tackle these challenges, we propose Potential Score Matching (PSM), an approach that utilizes the potential energy gradient to guide generative models. PSM does not require exact energy functions and can debias sample distributions even when trained on limited and biased data. Our method outperforms existing state-of-the-art (SOTA) models on the Lennard-Jones (LJ) potential, a commonly used toy model. Furthermore, we extend the evaluation of PSM to high-dimensional problems using the MD17 and MD22 datasets. The results demonstrate that molecular distributions generated by PSM more closely approximate the Boltzmann distribution compared to traditional diffusion models.
UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion
Zhang, Gongbo, Li, Yanting, Luo, Renqian, Hu, Pipi, Zhao, Zeru, Li, Lingbo, Liu, Guoqing, Wang, Zun, Bi, Ran, Gao, Kaiyuan, Guo, Liya, Xie, Yu, Liu, Chang, Zhang, Jia, Xie, Tian, Pinsler, Robert, Zeni, Claudio, Lu, Ziheng, Xia, Yingce, Segler, Marwin, Riechert, Maik, Yuan, Li, Chen, Lei, Liu, Haiguang, Qin, Tao
Unified generation of sequence and structure for scientific data (e.g., materials, molecules, proteins) is a critical task. Existing approaches primarily rely on either autoregressive sequence models or diffusion models, each offering distinct advantages and facing notable limitations. Autoregressive models, such as GPT, Llama, and Phi-4, have demonstrated remarkable success in natural language generation and have been extended to multimodal tasks (e.g., image, video, and audio) using advanced encoders like VQ-VAE to represent complex modalities as discrete sequences. However, their direct application to scientific domains is challenging due to the high precision requirements and the diverse nature of scientific data. On the other hand, diffusion models excel at generating high-dimensional scientific data, such as protein, molecule, and material structures, with remarkable accuracy. Yet, their inability to effectively model sequences limits their potential as general-purpose multimodal foundation models. To address these challenges, we propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models. This integration leverages the strengths of autoregressive models to ease the training of conditional diffusion models, while diffusion-based generative heads enhance the precision of autoregressive predictions. We validate the effectiveness of UniGenX on material and small molecule generation tasks, achieving a significant leap in state-of-the-art performance for material crystal structure prediction and establishing new state-of-the-art results for small molecule structure prediction, de novo design, and conditional generation. Notably, UniGenX demonstrates significant improvements, especially in handling long sequences for complex structures, showcasing its efficacy as a versatile tool for scientific data generation.
HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model
Ma, Mingqian, Liu, Guoqing, Cao, Chuan, Deng, Pan, Dao, Tri, Gu, Albert, Jin, Peiran, Yang, Zhao, Xia, Yingce, Luo, Renqian, Hu, Pipi, Wang, Zun, Chen, Yuan-Jyue, Liu, Haiguang, Qin, Tao
Advances in natural language processing and large language models have sparked growing interest in modeling DNA, often referred to as the "language of life". However, DNA modeling poses unique challenges. First, it requires the ability to process ultra-long DNA sequences while preserving single-nucleotide resolution, as individual nucleotides play a critical role in DNA function. Second, success in this domain requires excelling at both generative and understanding tasks: generative tasks hold potential for therapeutic and industrial applications, while understanding tasks provide crucial insights into biological mechanisms and diseases. To address these challenges, we propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture, seamlessly integrating the strengths of attention mechanisms with selective state-space models. This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution. HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks, and demonstrates exceptional capability in generating synthetic cis-regulatory elements (CREs) with desired properties. Furthermore, we show that HybriDNA adheres to expected scaling laws, with performance improving consistently as the model scales from 300M to 3B and 7B parameters. These findings underscore HybriDNA's versatility and its potential to advance DNA research and applications, paving the way for innovations in understanding and engineering the "language of life".
NatureLM: Deciphering the Language of Nature for Scientific Discovery
Xia, Yingce, Jin, Peiran, Xie, Shufang, He, Liang, Cao, Chuan, Luo, Renqian, Liu, Guoqing, Wang, Yue, Liu, Zequn, Chen, Yuan-Jyue, Guo, Zekun, Bai, Yeqi, Deng, Pan, Min, Yaosen, Lu, Ziheng, Hao, Hongxia, Yang, Han, Li, Jielan, Liu, Chang, Zhang, Jia, Zhu, Jianwei, Wu, Kehan, Zhang, Wei, Gao, Kaiyuan, Pei, Qizhi, Wang, Qian, Liu, Xixian, Li, Yanting, Zhu, Houtian, Lu, Yeqing, Ma, Mingqian, Wang, Zun, Xie, Tian, Maziarz, Krzysztof, Segler, Marwin, Yang, Zhao, Chen, Zilong, Shi, Yu, Zheng, Shuxin, Wu, Lijun, Hu, Chen, Dai, Peggy, Liu, Tie-Yan, Liu, Haiguang, Qin, Tao
Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, and RNA. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (briefly, NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) achieving state-of-the-art performance in tasks like SMILES-to-IUPAC translation and retrosynthesis on USPTO-50k. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.
Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity
Luo, Erpai, Wei, Xinran, Huang, Lin, Li, Yunyang, Yang, Han, Wang, Zun, Liu, Chang, Xia, Zaishuo, Zhang, Jia, Shao, Bin
Hamiltonian matrix prediction is pivotal in computational chemistry, serving as the foundation for determining a wide range of molecular properties. While SE(3) equivariant graph neural networks have achieved remarkable success in this domain, their substantial computational cost-driven by high-order tensor product (TP) operations-restricts their scalability to large molecular systems with extensive basis sets. To address this challenge, we introduce SPHNet, an efficient and scalable equivariant network that incorporates adaptive sparsity into Hamiltonian prediction. SPHNet employs two innovative sparse gates to selectively constrain non-critical interaction combinations, significantly reducing tensor product computations while maintaining accuracy. To optimize the sparse representation, we develop a Three-phase Sparsity Scheduler, ensuring stable convergence and achieving high performance at sparsity rates of up to 70 percent. Extensive evaluations on QH9 and PubchemQH datasets demonstrate that SPHNet achieves state-of-the-art accuracy while providing up to a 7x speedup over existing models. Beyond Hamiltonian prediction, the proposed sparsification techniques also hold significant potential for improving the efficiency and scalability of other SE(3) equivariant networks, further broadening their applicability and impact.
E2Former: A Linear-time Efficient and Equivariant Transformer for Scalable Molecular Modeling
Li, Yunyang, Huang, Lin, Ding, Zhihao, Wang, Chu, Wei, Xinran, Yang, Han, Wang, Zun, Liu, Chang, Shi, Yu, Jin, Peiran, Zhang, Jia, Gerstein, Mark, Qin, Tao
Equivariant Graph Neural Networks (EGNNs) have demonstrated significant success in modeling microscale systems, including those in chemistry, biology and materials science. However, EGNNs face substantial computational challenges due to the high cost of constructing edge features via spherical tensor products, making them impractical for large-scale systems. To address this limitation, we introduce E2Former, an equivariant and efficient transformer architecture that incorporates the Wigner $6j$ convolution (Wigner $6j$ Conv). By shifting the computational burden from edges to nodes, the Wigner $6j$ Conv reduces the complexity from $O(|\mathcal{E}|)$ to $ O(| \mathcal{V}|)$ while preserving both the model's expressive power and rotational equivariance. We show that this approach achieves a 7x-30x speedup compared to conventional $\mathrm{SO}(3)$ convolutions. Furthermore, our empirical results demonstrate that the derived E2Former mitigates the computational challenges of existing approaches without compromising the ability to capture detailed geometric information. This development could suggest a promising direction for scalable and efficient molecular modeling.
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
Wang, Zun, Li, Jialu, Lin, Han, Yoon, Jaehong, Bansal, Mohit
Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
Wang, Zun, Li, Jialu, Hong, Yicong, Li, Songze, Li, Kunchang, Yu, Shoubin, Wang, Yi, Qiao, Yu, Wang, Yali, Bansal, Mohit, Wang, Limin
Creating high-quality data for training robust language-instructed agents is a longlasting challenge in embodied AI. In this paper, we introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs by iteratively refining the data pool through the collaboration between two models, the instruction generator and the navigator, without any human-in-the-loop annotation. Specifically, SRDF starts with using a base generator to create an initial data pool for training a base navigator, followed by applying the trained navigator to filter the data pool. This leads to higher-fidelity data to train a better generator, which can, in turn, produce higher-quality data for training the next-round navigator. Such a flywheel establishes a data selfrefining process, yielding a continuously improved and highly effective dataset for large-scale language-guided navigation learning. Our experiments demonstrate that after several flywheel rounds, the navigator elevates the performance boundary from 70% to 78% SPL on the classic R2R test set, surpassing human performance (76%) for the first time. Meanwhile, this process results in a superior generator, evidenced by a SPICE increase from 23.5 to 26.2, better than all previous VLN instruction generation methods. Finally, we demonstrate the scalability of our method through increasing environment and instruction diversity, and the generalization ability of our pre-trained navigator across various downstream navigation tasks, surpassing state-of-the-art methods by a large margin in all cases. Figure 1: (a) Our Pipeline: After using the (instruction) generator to label paths for data augmentation in navigator training, we leverage the trained navigator to filter high-quality data to train a better generator, and the improved generator refines the data pool to train a stronger navigator, iteratively running on the flywheel. It also surpasses human performance on R2R and approaches human-level results on RxR-English and CVDN (for other tasks, human performance is not reported in their paper). The R2R result is from the test set, while others are from val unseen. The lack of high-quality data is one of the main bottlenecks in training embodied agents to complete real-world human activities. Unlike many other discriminative or generative learning problems, where the data itself naturally formulates a self-supervised learning objective (Devlin, 2018; He et al., 2022) or the data labeling can be facilitated by existing models (Ros et al., 2016; Tian et al., 2024), training embodied agents usually requires expensive human annotation on complex visionlinguistic contents and physical interactions.
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
Zhou, Gengze, Hong, Yicong, Wang, Zun, Zhao, Chongyang, Bansal, Mohit, Wu, Qi
Subsequent The academic field of learning instruction-guided visual works leverage generic vision-language representations navigation can be generally categorized into high-level [18, 59, 61, 96, 97] to pretrain vision-language-action category-specific search and low-level language-guided policies [14, 16, 32, 34, 36, 60, 74, 81] (Figure 1b), finetuning navigation, depending on the granularity of language instruction, parameters for specific tasks while maintaining the in which the former emphasizes the exploration same model architecture. In this paper, we argue that the process, while the latter concentrates on following detailed essential difference between these tasks lies in the granularity textual commands. Despite the differing focuses of these of instruction, and the learning problems should be unified tasks, the underlying requirements of interpreting instructions, under the broader concept of language-guided visual comprehending the surroundings, and inferring action navigation (VLN), where the overarching goal is to create decisions remain consistent. This paper consolidates a versatile system that can interpret and execute arbitrary diverse navigation tasks into a unified and generic framework language instructions (Figure 1c).