Goto

Collaborating Authors

 Sharlin, Samiha


Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

arXiv.org Artificial Intelligence

Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) molecular and material design; (3) automation and novel interfaces; (4) scientific communication and education; (5) research data management and automation; (6) hypothesis generation and evaluation; and (7) knowledge extraction and reasoning from scientific literature. Each team submission is presented in a summary table with links to the code and as brief papers in the appendix. Beyond team results, we discuss the hackathon event and its hybrid format, which included physical hubs in Toronto, Montreal, San Francisco, Berlin, Lausanne, and Tokyo, alongside a global online hub to enable local and virtual collaboration. Overall, the event highlighted significant improvements in LLM capabilities since the previous year's hackathon, suggesting continued expansion of LLMs for applications in materials science and chemistry research. These outcomes demonstrate the dual utility of LLMs as both multipurpose models for diverse machine learning tasks and platforms for rapid prototyping custom applications in scientific research.


In Context Learning and Reasoning for Symbolic Regression with Large Language Models

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are transformer-based machine learning models that have shown remarkable performance in tasks for which they were not explicitly trained. Here, we explore the potential of LLMs to perform symbolic regression -- a machine-learning method for finding simple and accurate equations from datasets. We prompt GPT-4 to suggest expressions from data, which are then optimized and evaluated using external Python tools. These results are fed back to GPT-4, which proposes improved expressions while optimizing for complexity and loss. Using chain-of-thought prompting, we instruct GPT-4 to analyze the data, prior expressions, and the scientific context (expressed in natural language) for each problem before generating new expressions. We evaluated the workflow in rediscovery of five well-known scientific equations from experimental data, and on an additional dataset without a known equation. GPT-4 successfully rediscovered all five equations, and in general, performed better when prompted to use a scratchpad and consider scientific context. We also demonstrate how strategic prompting improves the model's performance and how the natural language interface simplifies integrating theory with data. Although this approach does not outperform established SR programs where target equations are more complex, LLMs can nonetheless iterate toward improved solutions while following instructions and incorporating scientific context in natural language.


Incorporating Background Knowledge in Symbolic Regression using a Computer Algebra System

arXiv.org Artificial Intelligence

Since John Koza pioneered the paradigm of programming by means of natural selection, many applications for SR in scientific discovery have emerged [1]. Unlike other applications of machine learning techniques, scientific research demands explanation and verification, both of which are made more feasible by the generation of human-interpretable mathematical models (as opposed to fitting a model with thousands of parameters) [2-4]. Furthermore, SR can be effective even with very small datasets ( 10 items) such as those produced by difficult or expensive experiments which are not easily repeated. The mathematical expressions produced by SR can easily be extrapolated to untested or otherwise unreachable domains within a dataset (such as extreme pressures or temperatures). For decades, SR has discovered interesting models from data in many applications including inferring process models at the Dow Chemical Company [5], rainfall-runoff modeling [6], and rediscovering equations for double-pendulum motion [7]. Symbolic regression has been applied across all scales of scientific investigation, including the atomistic (interatomic potentials [8]), macroscopic (computational fluid dynamics [9]), and cosmological (dark matter overdensity [10]) scales. Some techniques facilitate search through billions of candidate expressions, such as the space of nonlinear descriptors of material properties [11]. While most applications of SR in science focus on identifying empirical patterns in data, such "data-only" approaches do not account for potential insights from background theory. In fact, some SR works emphasize their capabilities of discovery "without any prior knowledge about physics, kinematics, or geometry" [7].