Fu, Cong
MedCT: A Clinical Terminology Graph for Generative AI Applications in Healthcare
Chen, Ye, Huang, Dongdong, Xu, Haoyun, Fu, Cong, Sheng, Lin, Zhou, Qingli, Shen, Yuqiang, Wang, Kai
We introduce the world's first clinical terminology for the Chinese healthcare community, namely MedCT, accompanied by a clinical foundation model MedBERT and an entity linking model MedLink. The MedCT system enables standardized and programmable representation of Chinese clinical data, successively stimulating the development of new medicines, treatment pathways, and better patient outcomes for the populous Chinese community. Moreover, the MedCT knowledge graph provides a principled mechanism to minimize the hallucination problem of large language models (LLMs), therefore achieving significant levels of accuracy and safety in LLM-based clinical applications. By leveraging the LLMs' emergent capabilities of generativeness and expressiveness, we were able to rapidly built a production-quality terminology system and deployed to real-world clinical field within three months, while classical terminologies like SNOMED CT have gone through more than twenty years development. Our experiments show that the MedCT system achieves state-of-the-art (SOTA) performance in semantic matching and entity linking tasks, not only for Chinese but also for English. We also conducted a longitudinal field experiment by applying MedCT and LLMs in a representative spectrum of clinical tasks, including electronic health record (EHR) auto-generation and medical document search for diagnostic decision making. Our study shows a multitude of values of MedCT for clinical workflows and patient outcomes, especially in the new genre of clinical LLM applications. We present our approach in sufficient engineering detail, such that implementing a clinical terminology for other non-English societies should be readily reproducible. We openly release our terminology, models and algorithms, along with real-world clinical datasets for the development.
BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery
John, Peter St., Lin, Dejun, Binder, Polina, Greaves, Malcolm, Shah, Vega, John, John St., Lange, Adrian, Hsu, Patrick, Illango, Rajesh, Ramanathan, Arvind, Anandkumar, Anima, Brookes, David H, Busia, Akosua, Mahajan, Abhishaike, Malina, Stephen, Prasad, Neha, Sinai, Sam, Edwards, Lindsay, Gaudelet, Thomas, Regep, Cristian, Steinegger, Martin, Rost, Burkhard, Brace, Alexander, Hippe, Kyle, Naef, Luca, Kamata, Keisuke, Armstrong, George, Boyd, Kevin, Cao, Zhonglin, Chou, Han-Yi, Chu, Simon, Costa, Allan dos Santos, Darabi, Sajad, Dawson, Eric, Didi, Kieran, Fu, Cong, Geiger, Mario, Gill, Michelle, Hsu, Darren, Kaushik, Gagan, Korshunova, Maria, Kothen-Hill, Steven, Lee, Youhan, Liu, Meng, Livne, Micha, McClure, Zachary, Mitchell, Jonathan, Moradzadeh, Alireza, Mosafi, Ohad, Nashed, Youssef, Paliwal, Saee, Peng, Yuxing, Rabhi, Sara, Ramezanghorbani, Farhad, Reidenbach, Danny, Ricketts, Camir, Roland, Brian, Shah, Kushal, Shimko, Tyler, Sirelkhatim, Hassan, Srinivasan, Savitha, Stern, Abraham C, Toczydlowska, Dorota, Veccham, Srimukh Prasad, Venanzi, Niccolรฒ Alberto Elia, Vorontsov, Anton, Wilber, Jared, Wilkinson, Isabel, Wong, Wei Jing, Xue, Eva, Ye, Cory, Yu, Xin, Zhang, Yang, Zhou, Guoqing, Zandstein, Becca, Dallago, Christian, Trentini, Bruno, Kucukbenli, Emine, Paliwal, Saee, Rvachov, Timur, Calleja, Eddie, Israeli, Johnny, Clifford, Harry, Haukioja, Risto, Haemel, Nicholas, Tretina, Kyle, Tadimeti, Neha, Costa, Anthony B
Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use.
Residual Multi-Task Learner for Applied Ranking
Fu, Cong, Wang, Kun, Wu, Jiahua, Chen, Yizhou, Huzhang, Guangda, Ni, Yabo, Zeng, Anxiang, Zhou, Zhiming
Modern e-commerce platforms rely heavily on modeling diverse user feedback to provide personalized services. Consequently, multi-task learning has become an integral part of their ranking systems. However, existing multi-task learning methods encounter two main challenges: some lack explicit modeling of task relationships, resulting in inferior performance, while others have limited applicability due to being computationally intensive, having scalability issues, or relying on strong assumptions. To address these limitations and better fit our real-world scenario, pre-rank in Shopee Search, we introduce in this paper ResFlow, a lightweight multi-task learning framework that enables efficient cross-task information sharing via residual connections between corresponding layers of task networks. Extensive experiments on datasets from various scenarios and modalities demonstrate its superior performance and adaptability over state-of-the-art methods. The online A/B tests in Shopee Search showcase its practical value in large-scale industrial applications, evidenced by a 1.29% increase in OPU (order-per-user) without additional system latency. ResFlow is now fully deployed in the pre-rank module of Shopee Search. To facilitate efficient online deployment, we propose a novel offline metric Weighted Recall@K, which aligns well with our online metric OPU, addressing the longstanding online-offline metric misalignment issue. Besides, we propose to fuse scores from the multiple tasks additively when ranking items, which outperforms traditional multiplicative fusion. The code is released at https://github.com/BrunoTruthAlliance/ResFlow
SineNet: Learning Temporal Dynamics in Time-Dependent Partial Differential Equations
Zhang, Xuan, Helwig, Jacob, Lin, Yuchao, Xie, Yaochen, Fu, Cong, Wojtowytsch, Stephan, Ji, Shuiwang
We consider using deep neural networks to solve time-dependent partial differential equations (PDEs), where multi-scale processing is crucial for modeling complex, time-evolving dynamics. While the U-Net architecture with skip connections is commonly used by prior studies to enable multi-scale processing, our analysis shows that the need for features to evolve across layers results in temporally misaligned features in skip connections, which limits the model's performance. To address this limitation, we propose SineNet, consisting of multiple sequentially connected U-shaped network blocks, referred to as waves. In SineNet, high-resolution features are evolved progressively through multiple stages, thereby reducing the amount of misalignment within each stage. We furthermore analyze the role of skip connections in enabling both parallel and sequential processing of multi-scale information. Our method is rigorously tested on multiple PDE datasets, including the Navier-Stokes equations and shallow water equations, showcasing the advantages of our proposed approach over conventional U-Nets with a comparable parameter budget. We further demonstrate that increasing the number of waves in SineNet while maintaining the same number of parameters leads to a monotonically improved performance. The results highlight the effectiveness of SineNet and the potential of our approach in advancing the state-of-the-art in neural PDE solver design. Our code is available as part of AIRS (https://github.com/divelab/AIRS).
Complete and Efficient Graph Transformers for Crystal Material Property Prediction
Yan, Keqiang, Fu, Cong, Qian, Xiaofeng, Qian, Xiaoning, Ji, Shuiwang
Crystal structures are characterized by atomic bases within a primitive unit cell that repeats along a regular lattice throughout 3D space. The periodic and infinite nature of crystals poses unique challenges for geometric graph representation learning. Specifically, constructing graphs that effectively capture the complete geometric information of crystals and handle chiral crystals remains an unsolved and challenging problem. In this paper, we introduce a novel approach that utilizes the periodic patterns of unit cells to establish the lattice-based representation for each atom, enabling efficient and expressive graph representations of crystals. Furthermore, we propose ComFormer, a SE(3) transformer designed specifically for crystalline materials. ComFormer includes two variants; namely, iComFormer that employs invariant geometric descriptors of Euclidean distances and angles, and eComFormer that utilizes equivariant vector representations. Experimental results demonstrate the state-of-the-art predictive accuracy of ComFormer variants on various tasks across three widely-used crystal benchmarks. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS). However, the current reliance on traditional, costly, and time-consuming trial-and-error experimental methods poses practical challenges. In this regard, computational approaches based on quantum mechanics, such as density functional theory (DFT), have made significant contributions for predicting the physical and chemical properties of materials to guide materials discovery experiments. However, these crystal graph representations have limitations in distinguishing different crystalline materials. In other words, they cannot guarantee to capture the complete geometric information of input crystal structures, and may map different crystal structures with different properties to the same graph representation and produce the identical property predictions. Illustrative examples and detailed discussions can be found in Appendix A.1. While graph representations that can capture any structural differences in small molecules have been investigated in previous works (Wang et al., 2022; Klicpera et al., 2021), these methods fail short to capture periodic patterns of crystals and cannot maintain geometric completeness for crystals.
TigerBot: An Open Multilingual Multitask LLM
Chen, Ye, Cai, Wei, Wu, Liangmin, Li, Xiaowei, Xin, Zhanxuan, Fu, Cong
We release and introduce the TigerBot family of large language models (LLMs), consisting of base and chat models, sized from 7, 13, 70 and 180 billion parameters. We develop our models embarking from Llama-2 and BLOOM, and push the boundary further in data, training algorithm, infrastructure, and application tools. Our models yield meaningful performance gain over SOTA open-source models, e.g., Llama-2, specifically 6% gain in English and 20% gain in Chinese. TigerBot model family also achieves leading performance in major academic and industrial benchmarks and leaderboards. We believe that TigerBot represents just a snapshot of lightning-fast progression in LLM open-source community. Therefore, we are thrilled to give back by publicly releasing our models and reporting our approach behind, with additional emphases on building SOTA LLMs in a democratized way and making LLMs of use in real-world applications.
A Latent Diffusion Model for Protein Structure Generation
Fu, Cong, Yan, Keqiang, Wang, Limei, Au, Wing Yee, McThrow, Michael, Komikado, Tao, Maruhashi, Koji, Uchino, Kanji, Qian, Xiaoning, Ji, Shuiwang
Proteins are complex biomolecules that perform a variety of crucial functions within living organisms. Designing and generating novel proteins can pave the way for many future synthetic biology applications, including drug discovery. However, it remains a challenging computational task due to the large modeling space of protein structures. In this study, we propose a latent diffusion model that can reduce the complexity of protein modeling while flexibly capturing the distribution of natural protein structures in a condensed latent space. Specifically, we propose an equivariant protein autoencoder that embeds proteins into a latent space and then uses an equivariant diffusion model to learn the distribution of the latent protein representations. Experimental results demonstrate that our method can effectively generate novel protein backbone structures with high designability and efficiency.
Artificial Intelligence for Science in Quantum, Atomistic, and Continuum Systems
Zhang, Xuan, Wang, Limei, Helwig, Jacob, Luo, Youzhi, Fu, Cong, Xie, Yaochen, Liu, Meng, Lin, Yuchao, Xu, Zhao, Yan, Keqiang, Adams, Keir, Weiler, Maurice, Li, Xiner, Fu, Tianfan, Wang, Yucheng, Yu, Haiyang, Xie, YuQing, Fu, Xiang, Strasser, Alex, Xu, Shenglong, Liu, Yi, Du, Yuanqi, Saxton, Alexandra, Ling, Hongyi, Lawrence, Hannah, Stรคrk, Hannes, Gui, Shurui, Edwards, Carl, Gao, Nicholas, Ladera, Adriana, Wu, Tailin, Hofgard, Elyssa F., Tehrani, Aria Mansouri, Wang, Rui, Daigavane, Ameya, Bohde, Montgomery, Kurtin, Jerry, Huang, Qian, Phung, Tuong, Xu, Minkai, Joshi, Chaitanya K., Mathis, Simon V., Azizzadenesheli, Kamyar, Fang, Ada, Aspuru-Guzik, Alรกn, Bekkers, Erik, Bronstein, Michael, Zitnik, Marinka, Anandkumar, Anima, Ermon, Stefano, Liรฒ, Pietro, Yu, Rose, Gรผnnemann, Stephan, Leskovec, Jure, Ji, Heng, Sun, Jimeng, Barzilay, Regina, Jaakkola, Tommi, Coley, Connor W., Qian, Xiaoning, Qian, Xiaofeng, Smidt, Tess, Ji, Shuiwang
Advances in artificial intelligence (AI) are fueling a new paradigm of discoveries in natural sciences. Today, AI has started to advance natural sciences by improving, accelerating, and enabling our understanding of natural phenomena at a wide range of spatial and temporal scales, giving rise to a new area of research known as AI for science (AI4Science). Being an emerging research paradigm, AI4Science is unique in that it is an enormous and highly interdisciplinary area. Thus, a unified and technical treatment of this field is needed yet challenging. This work aims to provide a technically thorough account of a subarea of AI4Science; namely, AI for quantum, atomistic, and continuum systems. These areas aim at understanding the physical world from the subatomic (wavefunctions and electron density), atomic (molecules, proteins, materials, and interactions), to macro (fluids, climate, and subsurface) scales and form an important subarea of AI4Science. A unique advantage of focusing on these areas is that they largely share a common set of challenges, thereby allowing a unified and foundational treatment. A key common challenge is how to capture physics first principles, especially symmetries, in natural systems by deep learning methods. We provide an in-depth yet intuitive account of techniques to achieve equivariance to symmetry transformations. We also discuss other common technical challenges, including explainability, out-of-distribution generalization, knowledge transfer with foundation and large language models, and uncertainty quantification. To facilitate learning and education, we provide categorized lists of resources that we found to be useful. We strive to be thorough and unified and hope this initial effort may trigger more community interests and efforts to further advance AI4Science.
Group Equivariant Fourier Neural Operators for Partial Differential Equations
Helwig, Jacob, Zhang, Xuan, Fu, Cong, Kurtin, Jerry, Wojtowytsch, Stephan, Ji, Shuiwang
We consider solving partial differential equations (PDEs) with Fourier neural operators (FNOs), which operate in the frequency domain. Since the laws of physics do not depend on the coordinate system used to describe them, it is desirable to encode such symmetries in the neural operator architecture for better performance and easier learning. While encoding symmetries in the physical domain using group theory has been studied extensively, how to capture symmetries in the frequency domain is under-explored. In this work, we extend group convolutions to the frequency domain and design Fourier layers that are equivariant to rotations, translations, and reflections by leveraging the equivariance property of the Fourier transform. The resulting $G$-FNO architecture generalizes well across input resolutions and performs well in settings with varying levels of symmetry. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS).
Collaborative Policy Learning for Open Knowledge Graph Reasoning
Fu, Cong, Chen, Tong, Qu, Meng, Jin, Woojeong, Ren, Xiang
In recent years, there has been a surge of interests in interpretable graph reasoning methods. However, these models often suffer from limited performance when working on sparse and incomplete graphs, due to the lack of evidential paths that can reach target entities. Here we study open knowledge graph reasoning---a task that aims to reason for missing facts over a graph augmented by a background text corpus. A key challenge of the task is to filter out "irrelevant" facts extracted from corpus, in order to maintain an effective search space during path inference. We propose a novel reinforcement learning framework to train two collaborative agents jointly, i.e., a multi-hop graph reasoner and a fact extractor. The fact extraction agent generates fact triples from corpora to enrich the graph on the fly; while the reasoning agent provides feedback to the fact extractor and guides it towards promoting facts that are helpful for the interpretable reasoning. Experiments on two public datasets demonstrate the effectiveness of the proposed approach. Source code and datasets used in this paper can be downloaded at https://github.com/shanzhenren/CPL