Goto

Collaborating Authors

 Materials


Establishing baselines for generative discovery of inorganic crystals

arXiv.org Artificial Intelligence

Generative artificial intelligence offers a promising avenue for materials discovery, yet its advantages over traditional methods remain unclear. In this work, we introduce and benchmark two baseline approaches - random enumeration of charge-balanced prototypes and data-driven ion exchange of known compounds - against three generative models: a variational autoencoder, a large language model, and a diffusion model. Our results show that established methods such as ion exchange perform comparably well in generating stable materials, although many of these materials tend to closely resemble known compounds. In contrast, generative models excel at proposing novel structural frameworks and, when sufficient training data is available, can more effectively target properties such as electronic band gap and bulk modulus while maintaining a high stability rate. To enhance the performance of both the baseline and generative approaches, we implement a post-generation screening step in which all proposed structures are passed through stability and property filters from pre-trained machine learning models including universal interatomic potentials. This low-cost filtering step leads to substantial improvement in the success rates of all methods, remains computationally efficient, and ultimately provides a practical pathway toward more effective generative strategies for materials discovery.


QuantumBind-RBFE: Accurate Relative Binding Free Energy Calculations Using Neural Network Potentials

arXiv.org Artificial Intelligence

Accurate prediction of protein-ligand binding affinities is crucial in drug discovery, particularly during hit-to-lead and lead optimization phases, however, limitations in ligand force fields continue to impact prediction accuracy. In this work, we validate relative binding free energy (RBFE) accuracy using neural network potentials (NNPs) for the ligands. We utilize a novel NNP model, AceForce 1.0, based on the TensorNet architecture for small molecules that broadens the applicability to diverse drug-like compounds, including all important chemical elements and supporting charged molecules. Using established benchmarks, we show overall improved accuracy and correlation in binding affinity predictions compared with GAFF2 for molecular mechanics and ANI2-x for NNPs. Slightly less accuracy but comparable correlations with OPLS4. We also show that we can run the NNP simulations at 2 fs timestep, at least two times larger than previous NNP models, providing significant speed gains. The results show promise for further evolutions of free energy calculations using NNPs while demonstrating its practical use already with the current generation. The code and NNP model are publicly available for research use.


Learning Chemical Reaction Representation with Reactant-Product Alignment

arXiv.org Artificial Intelligence

Organic synthesis stands as a cornerstone of the chemical industry. The development of robust machine learning models to support tasks associated with organic reactions is of significant interest. However, current methods rely on hand-crafted features or direct adaptations of model architectures from other domains, which lack feasibility as data scales increase or ignore the rich chemical information inherent in reactions. To address these issues, this paper introduces RAlign, a novel chemical reaction representation learning model for various organic reaction-related tasks. By integrating atomic correspondence between reactants and products, our model discerns the molecular transformations that occur during the reaction, thereby enhancing comprehension of the reaction mechanism. We have designed an adapter structure to incorporate reaction conditions into the chemical reaction representation, allowing the model to handle various reaction conditions and to adapt to various datasets and downstream tasks. Additionally, we introduce a reaction-center-aware attention mechanism that enables the model to concentrate on key functional groups, thereby generating potent representations for chemical reactions. Our model has been evaluated on a range of downstream tasks. Experimental results indicate that our model markedly outperforms existing chemical reaction representation learning architectures on most of the datasets. We plan to open-source the code contingent upon the acceptance of the paper.


Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

arXiv.org Artificial Intelligence

Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) molecular and material design; (3) automation and novel interfaces; (4) scientific communication and education; (5) research data management and automation; (6) hypothesis generation and evaluation; and (7) knowledge extraction and reasoning from scientific literature. Each team submission is presented in a summary table with links to the code and as brief papers in the appendix. Beyond team results, we discuss the hackathon event and its hybrid format, which included physical hubs in Toronto, Montreal, San Francisco, Berlin, Lausanne, and Tokyo, alongside a global online hub to enable local and virtual collaboration. Overall, the event highlighted significant improvements in LLM capabilities since the previous year's hackathon, suggesting continued expansion of LLMs for applications in materials science and chemistry research. These outcomes demonstrate the dual utility of LLMs as both multipurpose models for diverse machine learning tasks and platforms for rapid prototyping custom applications in scientific research.


AIhub monthly digest: December 2024 – attending NeurIPS, multi-agent path finding, and tackling illegal mining

AIHub

Welcome to our monthly digest, where you can catch up with any AIhub stories you may have missed, peruse the latest news, recap recent events, and more. This month, we look back at our week attending NeurIPS, hear about work localising illegal mining sites using machine learning and geospatial data, and discover how a group of agents can minimise their journey length whilst avoiding collisions. We were lucky enough to attend the thirty-eighth Conference on Neural Information Processing Systems (NeurIPS 2024) which took place in Vancouver, Canada, from Tuesday 10 December to Sunday 15 December. On the first day of the event we held a session on science communication for AI researchers. It was great to see so many people there, and so many thoughtful questions following our presentation.


Intelligent Approaches to Predictive Analytics in Occupational Health and Safety in India

arXiv.org Artificial Intelligence

Concerns associated with occupational health and safety (OHS) remain critical and often under-addressed aspects of workforce management. This is especially true for high-risk industries such as manufacturing, construction, and mining. Such industries dominate the economy of India which is a developing country with a vast informal sector. Regulatory frameworks have been strengthened over the decades, particularly with regards to bringing the unorganized sector within the purview of law. Traditional approaches to OHS have largely been reactive and rely on post-incident analysis (which is curative) rather than preventive intervention. This paper portrays the immense potential of predictive analytics in rejuvenating OHS practices in India. Intelligent predictive analytics is driven by approaches like machine learning and statistical modeling. Its data-driven nature serves to overcome the limitations of conventional OHS methods. Predictive analytics approaches to OHS in India draw on global case studies and generative applications of predictive analytics in OHS which are customized to Indian industrial contexts. This paper attempts to explore in what ways it exhibits the potential to address challenges such as fragmented data ecosystems, resource constraints, and the variability of workplace hazards. The paper presents actionable policy recommendations to create conditions conducive to the widespread implementation of predictive analytics, which must be advocated as a cornerstone of OHS strategy. In doing so, the paper aims to spark a collaborational dialogue among policymakers, industry leaders, and technologists. It urges a shift towards intelligent practices to safeguard the well-being of India's workforce.


TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment

arXiv.org Artificial Intelligence

Training language models (LMs) and their application agents is increasingly costly due to large datasets and models, making test failures difficult to bear. Simplified language environments serve as primordial training and testing grounds, retaining essential commonsense and communication skills but in a more digestible form, potentially enhancing the learning efficiency of LMs, and thus reducing the required model size and data volume for effective training and evaluation. In these simplified language environments, workable strategies for small models, datasets, and agents may be adaptable to larger models, datasets, and agents in complex language environments. To create such environments, we focus on two aspects: i) minimizing language dataset noise and complexity, and ii) preserving the essential text distribution characteristics. Unlike previous methods, we propose a pipeline to refine text data by eliminating noise, minimizing vocabulary, and maintaining genre-specific patterns (e.g., for books, conversation, code, etc.). Implementing this pipeline with large LMs, we have created a leaner suite of LM training and evaluation datasets: 71M Leaner-Pretrain, 7M Leaner-Instruct, Leaner-Glue for assessing linguistic proficiency, and Leaner-Eval for testing instruction-following ability. Our experiments show that leaner pre-training boosts LM learning efficiency. Tiny LMs trained on these datasets outperform those trained on original datasets in instruction-following across different language granularity levels. Moreover, the Leaner-Pretrain dataset's alignment with conventional large LM training sets enables resource-optimized analysis of how learning objectives, model architectures, and training techniques impact performance on language modeling and downstream tasks. Our code and datasets are available at https://github.com/EmpathYang/TinyHelen.git.


Exploiting Boundary Loss for the Hierarchical Panoptic Segmentation of Plants and Leaves

arXiv.org Artificial Intelligence

Precision agriculture leverages data and machine learning so that farmers can monitor their crops and target interventions precisely. This enables the precision application of herbicide only to weeds, or the precision application of fertilizer only to undernourished crops, rather than to the entire field. The approach promises to maximize yields while minimizing resource use and harm to the surrounding environment. To this end, we propose a hierarchical panoptic segmentation method that simultaneously determines leaf count (as an identifier of plant growth)and locates weeds within an image. In particular, our approach aims to improve the segmentation of smaller instances like the leaves and weeds by incorporating focal loss and boundary loss. Not only does this result in competitive performance, achieving a PQ+ of 81.89 on the standard training set, but we also demonstrate we can improve leaf-counting accuracy with our method. The code is available at https://github.com/madeleinedarbyshire/HierarchicalMask2Former.


DropMicroFluidAgents (DMFAs): Autonomous Droplet Microfluidic Research Framework Through Large Language Model Agents

arXiv.org Artificial Intelligence

Applying Large language models (LLMs) within specific domains requires substantial adaptation to account for the unique terminologies, nuances, and context-specific challenges inherent to those areas. Here, we introduce DropMicroFluidAgents (DMFAs), an advanced language-driven framework leveraging state-of-the-art pre-trained LLMs. DMFAs employs LLM agents to perform two key functions: (1) delivering focused guidance, answers, and suggestions specific to droplet microfluidics and (2) generating machine learning models to optimise and automate the design of droplet microfluidic devices, including the creation of code-based computer-aided design (CAD) scripts to enable rapid and precise design execution. Experimental evaluations demonstrated that the integration of DMFAs with the LLAMA3.1 model yielded the highest accuracy of 76.15%, underscoring the significant performance enhancement provided by agent integration. This effect was particularly pronounced when DMFAs were paired with the GEMMA2 model, resulting in a 34.47% improvement in accuracy compared to the standalone GEMMA2 configuration. This study demonstrates the effective use of LLM agents in droplet microfluidics research as powerful tools for automating workflows, synthesising knowledge, optimising designs, and interacting with external systems. These capabilities enable their application across education and industrial support, driving greater efficiency in scientific discovery and innovation.


Extracting effective solutions hidden in large language models via generated comprehensive specialists: case studies in developing electronic devices

arXiv.org Artificial Intelligence

Recently, many studies have increasingly explored the use of large language models (LLMs) to generate research ideas and scientific hypotheses. However, real-world research and development often require solving complex, interdisciplinary challenges where solutions may not be readily found through existing knowledge related to the problem. Therefore, it is desirable to leverage the vast, comprehensive knowledge of LLMs to generate effective, breakthrough solutions by integrating various perspectives from other disciplines. Here, we propose SELLM (Solution Enumeration via comprehensive List and LLM), a framework leveraging LLMs and structured guidance using MECE (Mutually Exclusive, Collectively Exhaustive) principles, such as International Patent Classification (IPC) and the periodic table of elements. SELLM systematically constructs comprehensive expert agents from the list to generate cross-disciplinary and effective solutions. To evaluate SELLM's practicality, we applied it to two challenges: improving light extraction in organic light-emitting diode (OLED) lighting and developing electrodes for next-generation memory materials. The results demonstrate that SELLM significantly facilitates the generation of effective solutions compared to cases without specific customization or effort, showcasing the potential of SELLM to enable LLMs to generate effective solutions even for challenging problems.