Maziarz, Krzysztof
NatureLM: Deciphering the Language of Nature for Scientific Discovery
Xia, Yingce, Jin, Peiran, Xie, Shufang, He, Liang, Cao, Chuan, Luo, Renqian, Liu, Guoqing, Wang, Yue, Liu, Zequn, Chen, Yuan-Jyue, Guo, Zekun, Bai, Yeqi, Deng, Pan, Min, Yaosen, Lu, Ziheng, Hao, Hongxia, Yang, Han, Li, Jielan, Liu, Chang, Zhang, Jia, Zhu, Jianwei, Wu, Kehan, Zhang, Wei, Gao, Kaiyuan, Pei, Qizhi, Wang, Qian, Liu, Xixian, Li, Yanting, Zhu, Houtian, Lu, Yeqing, Ma, Mingqian, Wang, Zun, Xie, Tian, Maziarz, Krzysztof, Segler, Marwin, Yang, Zhao, Chen, Zilong, Shi, Yu, Zheng, Shuxin, Wu, Lijun, Hu, Chen, Dai, Peggy, Liu, Tie-Yan, Liu, Haiguang, Qin, Tao
Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, and RNA. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (briefly, NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) achieving state-of-the-art performance in tasks like SMILES-to-IUPAC translation and retrosynthesis on USPTO-50k. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.
Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases
Maziarz, Krzysztof, Liu, Guoqing, Misztela, Hubert, Kornev, Aleksei, Gaiลski, Piotr, Hoefling, Holger, Fortunato, Mike, Gupta, Rishi, Segler, Marwin
Planning and conducting chemical syntheses remains a major bottleneck in the discovery of functional small molecules, and prevents fully leveraging generative AI for molecular inverse design. While early work has shown that ML-based retrosynthesis models can predict reasonable routes, their low accuracy for less frequent, yet important reactions has been pointed out. As multi-step search algorithms are limited to reactions suggested by the underlying model, the applicability of those tools is inherently constrained by the accuracy of retrosynthesis prediction. Inspired by how chemists use different strategies to ideate reactions, we propose Chimera: a framework for building highly accurate reaction models that combine predictions from diverse sources with complementary inductive biases using a learning-based ensembling strategy. We instantiate the framework with two newly developed models, which already by themselves achieve state of the art in their categories. Through experiments across several orders of magnitude in data scale and time-splits, we show Chimera outperforms all major models by a large margin, owing both to the good individual performance of its constituents, but also to the scalability of our ensembling strategy. Moreover, we find that PhD-level organic chemists prefer predictions from Chimera over baselines in terms of quality. Finally, we transfer the largest-scale checkpoint to an internal dataset from a major pharmaceutical company, showing robust generalization under distribution shift. With the new dimension that our framework unlocks, we anticipate further acceleration in the development of even more accurate models.
Learning to Extend Molecular Scaffolds with Structural Motifs
Maziarz, Krzysztof, Jackson-Flux, Henry, Cameron, Pashmina, Sirockin, Finton, Schneider, Nadine, Stiefl, Nikolaus, Segler, Marwin, Brockschmidt, Marc
Recent advancements in deep learning-based modeling of molecules promise to accelerate in silico drug discovery. A plethora of generative models is available, building molecules either atom-by-atom and bond-by-bond or fragment-by-fragment. However, many drug discovery projects require a fixed scaffold to be present in the generated molecule, and incorporating that constraint has only recently been explored. Here, we propose MoLeR, a graph-based model that naturally supports scaffolds as initial seed of the generative procedure, which is possible because it is not conditioned on the generation history. Our experiments show that MoLeR performs comparably to state-of-the-art methods on unconstrained molecular optimization tasks, and outperforms them on scaffold-based tasks, while being an order of magnitude faster to train and sample from than existing approaches. Furthermore, we show the influence of a number of seemingly minor design choices on the overall performance.
Retro-fallback: retrosynthetic planning in an uncertain world
Tripp, Austin, Maziarz, Krzysztof, Lewis, Sarah, Segler, Marwin, Hernรกndez-Lobato, Josรฉ Miguel
Retrosynthesis is the task of proposing a series of chemical reactions to create a desired molecule from simpler, buyable molecules. While previous works have proposed algorithms to find optimal solutions for a range of metrics (e.g. shortest, lowest-cost), these works generally overlook the fact that we have imperfect knowledge of the space of possible reactions, meaning plans created by the algorithm may not work in a laboratory. In this paper we propose a novel formulation of retrosynthesis in terms of stochastic processes to account for this uncertainty. We then propose a novel greedy algorithm called retro-fallback which maximizes the probability that at least one synthesis plan can be executed in the lab. Using in-silico benchmarks we demonstrate that retro-fallback generally produces better sets of synthesis plans than the popular MCTS and retro* algorithms.
Retrosynthetic Planning with Dual Value Networks
Liu, Guoqing, Xue, Di, Xie, Shufang, Xia, Yingce, Tripp, Austin, Maziarz, Krzysztof, Segler, Marwin, Qin, Tao, Zhang, Zongzhang, Liu, Tie-Yan
Retrosynthesis, which aims to find a route to synthesize a target molecule from commercially available starting materials, is a critical task in drug discovery and materials design. Recently, the combination of ML-based single-step reaction predictors with multi-step planners has led to promising results. However, the single-step predictors are mostly trained offline to optimize the single-step accuracy, without considering complete routes. Here, we leverage reinforcement learning (RL) to improve the single-step predictor, by using a tree-shaped MDP to optimize complete routes. Specifically, we propose a novel online training algorithm, called Planning with Dual Value Networks (PDVN), which alternates between the planning phase and updating phase. In PDVN, we construct two separate value networks to predict the synthesizability and cost of molecules, respectively. To maintain the single-step accuracy, we design a two-branch network structure for the single-step predictor. On the widely-used USPTO dataset, our PDVN algorithm improves the search success rate of existing multi-step planners (e.g., increasing the success rate from 85.79% to 98.95% for Retro*, and reducing the number of model calls by half while solving 99.47% molecules for RetroGraph). Additionally, PDVN helps find shorter synthesis routes (e.g., reducing the average route length from 5.76 to 4.83 for Retro*, and from 5.63 to 4.78 for RetroGraph).
Are VAEs Bad at Reconstructing Molecular Graphs?
Muenkler, Hagen, Misztela, Hubert, Pikusa, Michal, Segler, Marwin, Schneider, Nadine, Maziarz, Krzysztof
Many contemporary generative models of molecules are variational auto-encoders of molecular graphs. One term in their training loss pertains to reconstructing the input, yet reconstruction capabilities of state-of-the-art models have not yet been thoroughly compared on a large and chemically diverse dataset. In this work, we show that when several state-of-the-art generative models are evaluated under the same conditions, their reconstruction accuracy is surprisingly low, worse than what was previously reported on seemingly harder datasets. However, we show that improving reconstruction does not directly lead to better sampling or optimization performance. Failed reconstructions from the MoLeR model are usually similar to the inputs, assembling the same motifs in a different way, and possess similar chemical properties such as solubility. Finally, we show that the input molecule and its failed reconstruction are usually mapped by the different encoders to statistically distinguishable posterior distributions, hinting that posterior collapse may not fully explain why VAEs are bad at reconstructing molecular graphs.
Gumbel-Matrix Routing for Flexible Multi-task Learning
Maziarz, Krzysztof, Kokiopoulou, Efi, Gesmundo, Andrea, Sbaiz, Luciano, Bartok, Gabor, Berent, Jesse
A BSTRACT This paper proposes a novel per-task routing method for multi-task applications. Multi-task neural networks can learn to transfer knowledge across different tasks by using parameter sharing. However, sharing parameters between unrelated tasks can hurt performance. To address this issue, we advocate the use of routing networks to learn flexible parameter sharing, where each group of parameters is shared with a different subset of tasks in order to better leverage tasks relatedness. At the same time, it is known that routing networks are notoriously hard to train. We propose the Gumbel-Matrix routing: a novel multi-task routing method, designed to learn fine-grained patterns of parameter sharing. The routing is learned jointly with the model parameters by standard back-propagation thanks to the Gumbel-Softmax trick. When applied to the Omniglot benchmark, the proposed method reduces the state-of-the-art error rate by 17% . 1 I NTRODUCTION Multi-task learning (Caruana, 1998; 1993) based on neural networks has attracted lots of research interest in the past years and has been successfully applied to several application domains, such as recommender systems (Bansal et al., 2016) and real-time object detection (Girshick, 2015). For instance, a movie recommendation system may optimize not only the likelihood of the user clicking on a suggested movie, but also the likelihood that the user is going to watch it. The most common architecture used in practice for multi-task learning is the so-called shared bottom, where the tasks share parameters in the early layers of the model, which are followed by task-specific heads. However, as our experiments on synthetic data show, when the tasks are unrelated, parameter sharing may actually hurt individual tasks performance. Therefore, resorting to flexible parameter sharing becomes very important.
Evolutionary-Neural Hybrid Agents for Architecture Search
Maziarz, Krzysztof, Khorlin, Andrey, de Laroussilhe, Quentin, Jastrzฤbski, Stanisลaw, Tan, Mingxing, Gesmundo, Andrea
Neural Architecture Search has recently shown potential to automate the design of Neural Networks. The use of Neural Network agents trained with Reinforcement Learning can offer the possibility to learn complex architectural patterns, as well as the ability to explore a vast and compositional search space. On the other hand, evolutionary algorithms offer the sample efficiency needed for such a resource intensive application. We propose a class of Evolutionary-Neural hybrid agents (Evo-NAS), that retain the qualities of the two approaches. We show that the Evo-NAS agent outperforms both Neural and Evolutionary agents when applied to architecture search for a suite of text classification and image classification benchmarks. On a high-complexity architecture search space for image classification, the Evo-NAS agent surpasses the performance of commonly used agents with only 1/3 of the trials.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Shazeer, Noam, Mirhoseini, Azalia, Maziarz, Krzysztof, Davis, Andy, Le, Quoc, Hinton, Geoffrey, Dean, Jeff
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.