Materials
Foundational Large Language Models for Materials Research
Mishra, Vaibhav, Singh, Somaditya, Ahlawat, Dhruv, Zaki, Mohd, Bihani, Vaibhav, Grover, Hargun Singh, Mishra, Biswajit, Miret, Santiago, Mausam, null, Krishnan, N. M. Anoop
Materials discovery and development are critical for addressing global challenges. Yet, the exponential growth in materials science literature comprising vast amounts of textual data has created significant bottlenecks in knowledge extraction, synthesis, and scientific reasoning. Large Language Models (LLMs) offer unprecedented opportunities to accelerate materials research through automated analysis and prediction. Still, their effective deployment requires domain-specific adaptation for understanding and solving domain-relevant tasks. Here, we present LLaMat, a family of foundational models for materials science developed through continued pretraining of LLaMA models on an extensive corpus of materials literature and crystallographic data. Through systematic evaluation, we demonstrate that LLaMat excels in materials-specific NLP and structured information extraction while maintaining general linguistic capabilities. The specialized LLaMat-CIF variant demonstrates unprecedented capabilities in crystal structure generation, predicting stable crystals with high coverage across the periodic table. Intriguingly, despite LLaMA-3's superior performance in comparison to LLaMA-2, we observe that LLaMat-2 demonstrates unexpectedly enhanced domain-specific performance across diverse materials science tasks, including structured information extraction from text and tables, more particularly in crystal structure generation, a potential adaptation rigidity in overtrained LLMs. Altogether, the present work demonstrates the effectiveness of domain adaptation towards developing practically deployable LLM copilots for materials research. Beyond materials science, our findings reveal important considerations for domain adaptation of LLMs, such as model selection, training methodology, and domain-specific performance, which may influence the development of specialized scientific AI systems.
Language model driven: a PROTAC generation pipeline with dual constraints of structure and property
Shao, Jinsong, Gong, Qineng, Yin, Zeyu, Chen, Yu, Hao, Yajie, Zhang, Lei, Jiang, Linlin, Yao, Min, Li, Jinlong, Wang, Fubo, Wang, Li
The imperfect modeling of ternary complexes has limited the application of computer-aided drug discovery tools in PROTAC research and development. In this study, an AI-assisted approach for PROTAC molecule design pipeline named LM-PROTAC was developed, which stands for language model driven Proteolysis Targeting Chimera, by embedding a transformer-based generative model with dual constraints on structure and properties, referred to as the DCT. This study utilized the fragmentation representation of molecules and developed a language model driven pipeline. Firstly, a language model driven affinity model for protein compounds to screen molecular fragments with high affinity for the target protein. Secondly, structural and physicochemical properties of these fragments were constrained during the generation process to meet specific scenario requirements. Finally, a two-round screening of the preliminary generated molecules using a multidimensional property prediction model to generate a batch of PROTAC molecules capable of degrading disease-relevant target proteins for validation in vitro experiments, thus achieving a complete solution for AI-assisted PROTAC drug generation. Taking the tumor key target Wnt3a as an example, the LM-PROTAC pipeline successfully generated PROTAC molecules capable of inhibiting Wnt3a. The results show that DCT can efficiently generate PROTAC that targets and hydrolyses Wnt3a.
OG-RAG: Ontology-Grounded Retrieval-Augmented Generation For Large Language Models
Sharma, Kartik, Kumar, Peeyush, Li, Yunqing
This paper presents OG-RAG, an Ontology-Grounded Retrieval Augmented Generation method designed to enhance LLM-generated responses by anchoring retrieval processes in domain-specific ontologies. While LLMs are widely used for tasks like question answering and search, they struggle to adapt to specialized knowledge, such as industrial workflows or knowledge work, without expensive fine-tuning or sub-optimal retrieval methods. Existing retrieval-augmented models, such as RAG, offer improvements but fail to account for structured domain knowledge, leading to suboptimal context generation. Ontologies, which conceptually organize domain knowledge by defining entities and their interrelationships, offer a structured representation to address this gap. OG-RAG constructs a hypergraph representation of domain documents, where each hyperedge encapsulates clusters of factual knowledge grounded using domain-specific ontology. An optimization algorithm then retrieves the minimal set of hyperedges that constructs a precise, conceptually grounded context for the LLM. This method enables efficient retrieval while preserving the complex relationships between entities. OG-RAG applies to domains where fact-based reasoning is essential, particularly in tasks that require workflows or decision-making steps to follow predefined rules and procedures. These include industrial workflows in healthcare, legal, and agricultural sectors, as well as knowledge-driven tasks such as news journalism, investigative research, consulting and more. Our evaluations demonstrate that OG-RAG increases the recall of accurate facts by 55% and improves response correctness by 40% across four different LLMs. Additionally, OG-RAG enables 30% faster attribution of responses to context and boosts fact-based reasoning accuracy by 27% compared to baseline methods.
Quantum Kernel-Based Long Short-term Memory for Climate Time-Series Forecasting
Hsu, Yu-Chao, Chen, Nan-Yow, Li, Tai-Yu, Po-Heng, null, Lee, null, Chen, Kuan-Cheng
--We present the Quantum Kernel-Based Long Short-T erm Memory (QK-LSTM) network, which integrates quantum kernel methods into classical LSTM architectures to enhance predictive accuracy and computational efficiency in climate time-series forecasting tasks, such as Air Quality Index (AQI) prediction. Leveraging quantum kernel methods allows for efficient computation of inner products in quantum spaces, addressing the computational challenges faced by classical models and variational quantum circuit-based models. Designed for the Noisy Intermediate-Scale Quantum (NISQ) era, QK-LSTM supports scalable hybrid quantum-classical implementations. Experimental results demonstrate that QK-LSTM outperforms classical LSTM networks in AQI forecasting, showcasing its potential for environmental monitoring and resource-constrained scenarios, while highlighting the broader applicability of quantum-enhanced machine learning frameworks in tackling large-scale, high-dimensional climate datasets. Climate time-series forecasting is essential for understanding and predicting environmental phenomena, which has significant implications for public health [1], resource management [2], and policy-making [3]. Accurate forecasting of climatic variables such as temperature, precipitation, and pollutant concentrations enables proactive measures to mitigate adverse effects associated with climate variability and change.
Dual Random Fields and their Application to Mineral Potential Mapping
In various geosciences branches, including mineral exploration, geometallurgical characterization on established mining operations, and remote sensing, the regionalized input variables are spatially well-sampled across the domain of interest, limiting the scope of spatial uncertainty quantification procedures. In turn, response outcomes such as the mineral potential in a given region, mining throughput, metallurgical recovery, or in-situ estimations from remote satellite imagery, are usually modeled from a much-restricted subset of testing samples, collected at certain locations due to accessibility restrictions and the high acquisition costs. Our limited understanding of these functions, in terms of the multi-dimensional complexity of causalities and unnoticed dependencies on inaccessible inputs, may lead to observing changes in such functions based on their geographical location. Pooling together different response functions across the domain is critical to correctly predict outcome responses, the uncertainty associated with these inferred values, and the significance of inputs in such predictions at unexplored areas. This paper introduces the notion of a dual random field (dRF), where the response function itself is considered a regionalized variable. In this way, different established response models across the geographic domain can be considered as observations of a dRF realization, enabling the spatial inference and uncertainty assessment of both response models and their predictions. We explain how dRFs inherit all the properties from classical random fields, allowing the use of standard Gaussian simulation procedures to simulate them. These models are combined to obtain a mineral potential response, providing an example of how to rigorously integrate machine learning approaches with geostatistics.
Combining knowledge graphs and LLMs for hazardous chemical information management and reuse
Da Silveira, Marcos, Deladiennee, Louis, Acem, Kheira, Freudenthal, Oona
Human health is increasingly threatened by exposure to hazardous substances, particularly persistent and toxic chemicals. The link between these substances, often encountered in complex mixtures, and various diseases are demonstrated in scientific studies. However, this information is scattered across several sources and hardly accessible by humans and machines. This paper evaluates current practices for publishing/accessing information on hazardous chemicals and proposes a novel platform designed to facilitate retrieval of critical chemical data in urgent situations. The platform aggregates information from multiple sources and organizes it into a structured knowledge graph. Users can access this information through a visual interface such as Neo4J Bloom and dashboards, or via natural language queries using a Chatbot. Our findings demonstrate a significant reduction in the time and effort required to access vital chemical information when datasets follow FAIR principles. Furthermore, we discuss the lessons learned from the development and implementation of this platform and provide recommendations for data owners and publishers to enhance data reuse and interoperability. This work aims to improve the accessibility and usability of chemical information by healthcare professionals, thereby supporting better health outcomes and informed decision-making in the face of patients exposed to chemical intoxication risks.
Optimization-Driven Design of Monolithic Soft-Rigid Grippers
Mansueto, Pierluigi, Dragusanu, Mihai, Saeed, Anjum, Malvezzi, Monica, Lapucci, Matteo, Salvietti, Gionata
Sim-to-real transfer remains a significant challenge in soft robotics due to the unpredictability introduced by common manufacturing processes such as 3D printing and molding. These processes often result in deviations from simulated designs, requiring multiple prototypes before achieving a functional system. In this study, we propose a novel methodology to address these limitations by combining advanced rapid prototyping techniques and an efficient optimization strategy. Firstly, we employ rapid prototyping methods typically used for rigid structures, leveraging their precision to fabricate compliant components with reduced manufacturing errors. Secondly, our optimization framework minimizes the need for extensive prototyping, significantly reducing the iterative design process. The methodology enables the identification of stiffness parameters that are more practical and achievable within current manufacturing capabilities. The proposed approach demonstrates a substantial improvement in the efficiency of prototype development while maintaining the desired performance characteristics. This work represents a step forward in bridging the sim-to-real gap in soft robotics, paving the way towards a faster and more reliable deployment of soft robotic systems.
The 50 greatest innovations of 2024
In 1988, we launched the Best of What's New Awards. The original list highlighted "the very things that make our lives more comfortable, more rewarding, more exciting, and more fun," to quote then-Publisher Grant A. Burnett. Now, in 2024, we continue our decades-old tradition of honoring big ideas. We even see hints of our original honorees in this year's list: Sea-Doo and Ford made both lists, 36 years apart. We're proud to bring you promising innovations--from things that make life at home easier to literal out-of-this-world explorations. This is the Best of What's New 2024. Had you asked me at the beginning of 2024 what our best gadgets list would look like, I'd have guessed it would be filled with quirky AI-driven devices like the rabbit R1 or the Humane Ai Pin. "Now with AI" is a phrase that has dominated consumer electronics in the 2020s. These devices promised unadulterated access to the power of neural networks in ways that would seamlessly integrate into our lives without relying on phones or smart fridges. Then, the devices came out. The software is slow and buggy, and the hardware is clunky. Maybe the stand-alone AI device will still have its year, and we'll look back and chuckle at these humble beginnings. In reality, 2024's big breakthrough came from Apple in the form of its long-rumored Vision Pro headset. The device has its own hurdles to clear, but after just a few minutes of using it, it was clear that it's something different, important, and honestly pretty amazing. The list also includes Sony's innovative pro-grade camera, the most accessible drone we've ever used, and a no-fun phone--no fun in a good way, of course. Credible rumors of Apple's VR bounced around the gadget blogs and tech sites for nearly a decade. It was consumer tech's sasquatch in that people claimed to have seen it, but no one knew if it even existed. Then, the Vision Pro emerged from the proverbial forest in February with a surprising design and a massive 3,500 price tag. It also came toting a new R-series chip and a dedicated OS meant for spatial computing.
Food for thought: How can machine learning help better predict and understand changes in food prices?
Kupferschmidt, Kristina L., Requiema, James, Simpson, Mya, Varsallay, Zohrah, Jackson, Ethan, Kupferschmidt, Cody, El-Shawa, Sara, Taylor, Graham W.
In this work, we address a lack of systematic understanding of fluctuations in food affordability in Canada. Canada's Food Price Report (CPFR) is an annual publication that predicts food inflation over the next calendar year. The published predictions are a collaborative effort between forecasting teams that each employ their own approach at Canadian Universities: Dalhousie University, the University of British Columbia, the University of Saskatchewan, and the University of Guelph/Vector Institute. While the University of Guelph/Vector Institute forecasting team has leveraged machine learning (ML) in previous reports, the most recent editions (2024--2025) have also included a human-in-the-loop approach. For the 2025 report, this focus was expanded to evaluate several different data-centric approaches to improve forecast accuracy. In this study, we evaluate how different types of forecasting models perform when estimating food price fluctuations. We also examine the sensitivity of models that curate time series data representing key factors in food pricing.
Synergistic Development of Perovskite Memristors and Algorithms for Robust Analog Computing
Ye, Nanyang, Sun, Qiao, Wang, Yifei, Yang, Liujia, Zhou, Jundong, Wang, Lei, Yang, Guang-Zhong, Wang, Xinbing, Zhou, Chenghu, Ren, Wei, Gu, Leilei, Wu, Huaqiang, Gu, Qinying
Analog computing using non-volatile memristors has emerged as a promising solution for energy-efficient deep learning. New materials, like perovskites-based memristors are recently attractive due to their cost-effectiveness, energy efficiency and flexibility. Yet, challenges in material diversity and immature fabrications require extensive experimentation for device development. Moreover, significant non-idealities in these memristors often impede them for computing. Here, we propose a synergistic methodology to concurrently optimize perovskite memristor fabrication and develop robust analog DNNs that effectively address the inherent non-idealities of these memristors. Employing Bayesian optimization (BO) with a focus on usability, we efficiently identify optimal materials and fabrication conditions for perovskite memristors. Meanwhile, we developed "BayesMulti", a DNN training strategy utilizing BO-guided noise injection to improve the resistance of analog DNNs to memristor imperfections. Our approach theoretically ensures that within a certain range of parameter perturbations due to memristor non-idealities, the prediction outcomes remain consistent. Our integrated approach enables use of analog computing in much deeper and wider networks, which significantly outperforms existing methods in diverse tasks like image classification, autonomous driving, species identification, and large vision-language models, achieving up to 100-fold improvements. We further validate our methodology on a 10$\times$10 optimized perovskite memristor crossbar, demonstrating high accuracy in a classification task and low energy consumption. This study offers a versatile solution for efficient optimization of various analog computing systems, encompassing both devices and algorithms.