Materials
EnzChemRED, a rich enzyme chemistry relation extraction dataset
Lai, Po-Ting, Coudert, Elisabeth, Aimo, Lucila, Axelsen, Kristian, Breuza, Lionel, de Castro, Edouard, Feuermann, Marc, Morgat, Anne, Pourcel, Lucille, Pedruzzi, Ivo, Poux, Sylvain, Redaschi, Nicole, Rivoire, Catherine, Sveshnikova, Anastasia, Wei, Chih-Hsuan, Leaman, Robert, Luo, Ling, Lu, Zhiyong, Bridge, Alan
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/.
Integrating Chemistry Knowledge in Large Language Models via Prompt Engineering
Liu, Hongxuan, Yin, Haoyu, Luo, Zhiyao, Wang, Xiaonan
This paper presents a study on the integration of domain-specific knowledge in prompt engineering to enhance the performance of large language models (LLMs) in scientific domains. A benchmark dataset is curated to encapsulate the intricate physical-chemical properties of small molecules, their drugability for pharmacology, alongside the functional attributes of enzymes and crystal materials, underscoring the relevance and applicability across biological and chemical domains.The proposed domain-knowledge embedded prompt engineering method outperforms traditional prompt engineering strategies on various metrics, including capability, accuracy, F1 score, and hallucination drop. The effectiveness of the method is demonstrated through case studies on complex materials including the MacMillan catalyst, paclitaxel, and lithium cobalt oxide. The results suggest that domain-knowledge prompts can guide LLMs to generate more accurate and relevant responses, highlighting the potential of LLMs as powerful tools for scientific discovery and innovation when equipped with domain-specific prompts. The study also discusses limitations and future directions for domain-specific prompt engineering development.
Combining and Decoupling Rigid and Soft Grippers to Enhance Robotic Manipulation
Keely, Maya, Kim, Yeunhee, Mehta, Shaunak A., Hoegerman, Joshua, Sanchez, Robert Ramirez, Paul, Emily, Mills, Camryn, Losey, Dylan P., Bartlett, Michael D.
For robot arms to perform everyday tasks in unstructured environments, these robots must be able to manipulate a diverse range of objects. Today's robots often grasp objects with either soft grippers or rigid end-effectors. However, purely rigid or purely soft grippers have fundamental limitations: soft grippers struggle with irregular, heavy objects, while rigid grippers often cannot grasp small, numerous items. In this paper we therefore introduce RISOs, a mechanics and controls approach for unifying traditional RIgid end-effectors with a novel class of SOft adhesives. When grasping an object, RISOs can use either the rigid end-effector (pinching the item between non-deformable fingers) and/or the soft materials (attaching and releasing items with switchable adhesives). This enhances manipulation capabilities by combining and decoupling rigid and soft mechanisms. With RISOs robots can perform grasps along a spectrum from fully rigid, to fully soft, to rigid-soft, enabling real time object manipulation across a 1 million times range in weight (from 2 mg to 2 kg). To develop RISOs we first model and characterize the soft switchable adhesives. We then mount sheets of these soft adhesives on the surfaces of rigid end-effectors, and develop control strategies that make it easier for robot arms and human operators to utilize RISOs. The resulting RISO grippers were able to pick-up, carry, and release a larger set of objects than existing grippers, and participants also preferred using RISO. Overall, our experimental and user study results suggest that RISOs provide an exceptional gripper range in both capacity and object diversity. See videos of our user studies here: https://youtu.be/du085R0gPFI
React-OT: Optimal Transport for Generating Transition State in Chemical Reactions
Duan, Chenru, Liu, Guan-Horng, Du, Yuanqi, Chen, Tianrong, Zhao, Qiyuan, Jia, Haojun, Gomes, Carla P., Theodorou, Evangelos A., Kulik, Heather J.
Transition states (TSs) are transient structures that are key in understanding reaction mechanisms and designing catalysts but challenging to be captured in experiments. Alternatively, many optimization algorithms have been developed to search for TSs computationally. Yet the cost of these algorithms driven by quantum chemistry methods (usually density functional theory) is still high, posing challenges for their applications in building large reaction networks for reaction exploration. Here we developed React-OT, an optimal transport approach for generating unique TS structures from reactants and products. React-OT generates highly accurate TS structures with a median structural root mean square deviation (RMSD) of 0.053{\AA} and median barrier height error of 1.06 kcal/mol requiring only 0.4 second per reaction. The RMSD and barrier height error is further improved by roughly 25% through pretraining React-OT on a large reaction dataset obtained with a lower level of theory, GFN2-xTB. We envision the great accuracy and fast inference of React-OT useful in targeting TSs when exploring chemical reactions with unknown mechanisms.
What Generative Artificial Intelligence Means for Terminological Definitions
This paper examines the impact of Generative Artificial Intelligence (GenAI) tools like ChatGPT on the creation and consumption of terminological definitions. From the terminologist's point of view, the strategic use of GenAI tools can streamline the process of crafting definitions, reducing both time and effort, while potentially enhancing quality. GenAI tools enable AI-assisted terminography, notably post-editing terminography, where the machine produces a definition that the terminologist then corrects or refines. However, the potential of GenAI tools to fulfill all the terminological needs of a user, including term definitions, challenges the very existence of terminological definitions and resources as we know them. Unlike terminological definitions, GenAI tools can describe the knowledge activated by a term in a specific context. However, a main drawback of these tools is that their output can contain errors. For this reason, users requiring reliability will likely still resort to terminological resources for definitions. Nevertheless, with the inevitable integration of AI into terminology work, the distinction between human-created and AI-created content will become increasingly blurred.
Implementing Hottopixx Methods for Endmember Extraction in Hyperspectral Images
Hyperspectral imaging technology has a wide range of applications, including forest management, mineral resource exploration, and Earth surface monitoring. Endmember extraction of hyperspectral images is a key step in leveraging this technology for applications. It aims to identifying the spectral signatures of materials, i.e., the major components in the observed scenes. Theoretically speaking, Hottopixx methods should be effective on problems involving extracting endmembers from hyperspectral images. Yet, these methods are challenging to perform in practice, due to high computational costs. They require us to solve LP problems, called Hottopixx models, whose size grows quadratically with the number of pixels in the image. It is thus still unclear as to whether they are actually effective or not. This study clarifies this situation. We propose an efficient and effective implementation of Hottopixx. Our implementation follows the framework of column generation, which is known as a classical but powerful means of solving large-scale LPs. We show in experiments that our implementation is applicable to the endmember extraction from real hyperspectral images and can provide estimations of endmember signatures with higher accuracy than the existing methods can.
Towards Robust Ferrous Scrap Material Classification with Deep Learning and Conformal Prediction
Santos, Paulo Henrique dos, Santos, Valรฉria de Carvalho, Luz, Eduardo Josรฉ da Silva
In the steel production domain, recycling ferrous scrap is essential for environmental and economic sustainability, as it reduces both energy consumption and greenhouse gas emissions. However, the classification of scrap materials poses a significant challenge, requiring advancements in automation technology. Additionally, building trust among human operators is a major obstacle. Traditional approaches often fail to quantify uncertainty and lack clarity in model decision-making, which complicates acceptance. In this article, we describe how conformal prediction can be employed to quantify uncertainty and add robustness in scrap classification. We have adapted the Split Conformal Prediction technique to seamlessly integrate with state-of-the-art computer vision models, such as the Vision Transformer (ViT), Swin Transformer, and ResNet-50, while also incorporating Explainable Artificial Intelligence (XAI) methods. We evaluate the approach using a comprehensive dataset of 8147 images spanning nine ferrous scrap classes. The application of the Split Conformal Prediction method allowed for the quantification of each model's uncertainties, which enhanced the understanding of predictions and increased the reliability of the results. Specifically, the Swin Transformer model demonstrated more reliable outcomes than the others, as evidenced by its smaller average size of prediction sets and achieving an average classification accuracy exceeding 95%. Furthermore, the Score-CAM method proved highly effective in clarifying visual features, significantly enhancing the explainability of the classification decisions.
UAlign: Pushing the Limit of Template-free Retrosynthesis Prediction with Unsupervised SMILES Alignment
Zeng, Kaipeng, yang, Bo, Zhao, Xin, Zhang, Yu, Nie, Fan, Yang, Xiaokang, Jin, Yaohui, Xu, Yanyan
Motivation: Retrosynthesis planning poses a formidable challenge in the organic chemical industry. Single-step retrosynthesis prediction, a crucial step in the planning process, has witnessed a surge in interest in recent years due to advancements in AI for science. Various deep learning-based methods have been proposed for this task in recent years, incorporating diverse levels of additional chemical knowledge dependency. Results: This paper introduces UAlign, a template-free graph-to-sequence pipeline for retrosynthesis prediction. By combining graph neural networks and Transformers, our method can more effectively leverage the inherent graph structure of molecules. Based on the fact that the majority of molecule structures remain unchanged during a chemical reaction, we propose a simple yet effective SMILES alignment technique to facilitate the reuse of unchanged structures for reactant generation. Extensive experiments show that our method substantially outperforms state-of-the-art template-free and semi-template-based approaches. Importantly, our template-free method achieves effectiveness comparable to, or even surpasses, established powerful template-based methods. Scientific contribution: We present a novel graph-to-sequence template-free retrosynthesis prediction pipeline that overcomes the limitations of Transformer-based methods in molecular representation learning and insufficient utilization of chemical information. We propose an unsupervised learning mechanism for establishing product-atom correspondence with reactant SMILES tokens, achieving even better results than supervised SMILES alignment methods. Extensive experiments demonstrate that UAlign significantly outperforms state-of-the-art template-free methods and rivals or surpasses template-based approaches, with up to 5\% (top-5) and 5.4\% (top-10) increased accuracy over the strongest baseline.
HalluciBot: Is There No Such Thing as a Bad Question?
Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). In this context, an overwhelming number of studies have focused on analyzing the post-generation phase - refining outputs via feedback, analyzing logit output values, or deriving clues via the outputs' artifacts. We propose HalluciBot, a model that predicts the probability of hallucination $\textbf{before generation}$, for any query imposed to an LLM. In essence, HalluciBot does not invoke any generation during inference. To derive empirical evidence for HalluciBot, we employ a Multi-Agent Monte Carlo Simulation using a Query Perturbator to craft $n$ variations per query at train time. The construction of our Query Perturbator is motivated by our introduction of a new definition of hallucination - $\textit{truthful hallucination}$. Our training methodology generated 2,219,022 estimates for a training corpus of 369,837 queries, spanning 13 diverse datasets and 3 question-answering scenarios. HalluciBot predicts both binary and multi-class probabilities of hallucination, enabling a means to judge the query's quality with regards to its propensity to hallucinate. Therefore, HalluciBot paves the way to revise or cancel a query before generation and the ensuing computational waste. Moreover, it provides a lucid means to measure user accountability for hallucinatory queries.
Adaptive Catalyst Discovery Using Multicriteria Bayesian Optimization with Representation Learning
Chen, Jie, Ou, Pengfei, Chang, Yuxin, Zhang, Hengrui, Li, Xiao-Yan, Sargent, Edward H., Chen, Wei
High-performance catalysts are crucial for sustainable energy conversion and human health. However, the discovery of catalysts faces challenges due to the absence of efficient approaches to navigating vast and high-dimensional structure and composition spaces. In this study, we propose a high-throughput computational catalyst screening approach integrating density functional theory (DFT) and Bayesian Optimization (BO). Within the BO framework, we propose an uncertainty-aware atomistic machine learning model, UPNet, which enables automated representation learning directly from high-dimensional catalyst structures and achieves principled uncertainty quantification. Utilizing a constrained expected improvement acquisition function, our BO framework simultaneously considers multiple evaluation criteria. Using the proposed methods, we explore catalyst discovery for the CO2 reduction reaction. The results demonstrate that our approach achieves high prediction accuracy, facilitates interpretable feature extraction, and enables multicriteria design optimization, leading to significant reduction of computing power and time (10x reduction of required DFT calculations) in high-performance catalyst discovery.