Materials
Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models
Wadell, Alexius, Bhutani, Anoushka, Viswanathan, Venkatasubramanian
Molecular Foundation Models are emerging as powerful tools for accelerating molecular design, material science, and cheminformatics, leveraging transformer architectures to speed up the discovery of new materials and drugs while reducing the computational cost of traditional ab initio methods. However, current models are constrained by closed-vocabulary tokenizers that fail to capture the full diversity of molecular structures. In this work, we systematically evaluate thirteen chemistry-specific tokenizers for their coverage of the SMILES language, uncovering substantial gaps. Using N-gram language models, we accessed the impact of tokenizer choice on model performance and quantified the information loss of unknown tokens. We introduce two new tokenizers, smirk and smirk-gpe, which can represent the entirety of the OpenSMILES specification while avoiding the pitfalls of existing tokenizers. Our work highlights the importance of open-vocabulary modeling for molecular foundation models and the need for chemically diverse benchmarks for cheminformatics.
A Machine Learning-Driven Wireless System for Structural Health Monitoring
Pop, Marius, Tudose, Mihai, Visan, Daniel, Bocioaga, Mircea, Botan, Mihai, Banu, Cesar, Salaoru, Tiberiu
The paper presents a wireless system integrated with a machine learning (ML) model for structural health monitoring (SHM) of carbon fiber reinforced polymer (CFRP) structures, primarily targeting aerospace applications. The system collects data via carbon nanotube (CNT) piezoresistive sensors embedded within CFRP coupons, wirelessly transmitting these data to a central server for processing. A deep neural network (DNN) model predicts mechanical properties and can be extended to forecast structural failures, facilitating proactive maintenance and enhancing safety. The modular design supports scalability and can be embedded within digital twin frameworks, offering significant benefits to aircraft operators and manufacturers. The system utilizes an ML model with a mean absolute error (MAE) of 0.14 on test data for forecasting mechanical properties. Data transmission latency throughout the entire system is less than one second in a LAN setup, highlighting its potential for real-time monitoring applications in aerospace and other industries. However, while the system shows promise, challenges such as sensor reliability under extreme environmental conditions and the need for advanced ML models to handle diverse data streams have been identified as areas for future research.
Navigating Process Mining: A Case study using pm4py
Process-mining techniques have emerged as powerful tools for analyzing event data to gain insights into business processes. In this paper, we present a comprehensive analysis of road traffic fine management processes using the pm4py library in Python. We start by importing an event log dataset and explore its characteristics, including the distribution of activities and process variants. Through filtering and statistical analysis, we uncover key patterns and variations in the process executions. Subsequently, we apply various process-mining algorithms, including the Alpha Miner, Inductive Miner, and Heuristic Miner, to discover process models from the event log data. We visualize the discovered models to understand the workflow structures and dependencies within the process. Additionally, we discuss the strengths and limitations of each mining approach in capturing the underlying process dynamics. Our findings shed light on the efficiency and effectiveness of road traffic fine management processes, providing valuable insights for process optimization and decision-making. This study demonstrates the utility of pm4py in facilitating process mining tasks and its potential for analyzing real-world business processes.
Advancing Towards a Marine Digital Twin Platform: Modeling the Mar Menor Coastal Lagoon Ecosystem in the South Western Mediterranean
Ye, Yu, González-Vidal, Aurora, Cisterna-García, Alejandro, Pérez-Ruzafa, Angel, Izquierdo, Miguel A. Zamora, Skarmeta, Antonio F.
Oceans are vital for sustaining require continuous monitoring of various indicators to detect life on Earth and they contribute substantially to global food or alert us to changes. Current observational deployments sources, oxygen production, and carbon dioxide absorption are often restricted to the ocean surface and a few measurable (Riebesell et al., 2009). Marine environments suffer from variables and there are limited tools to process the data numerous sources of stress, mostly from human activities in and extract useful knowledge. This underscores the need coastal areas, urban, agricultural, and industrial discharges, for advanced modeling techniques to bridge gaps in our habitat destruction, introduction of invasive species, and oil comprehension and to allow intelligent action-taking. But, spills, which interact synergistically with the consequences more importantly, the mere detection of problems may not of climate change. In addition to classic pollutants, such be sufficient since, on the one hand, the homeorhetic mechanisms as heavy metals or pesticides, with a long tradition in human of biological systems may mask such indicators until activities such as mining, industry, or agriculture, new it is too late and, on the other hand, the speed of ecosystem emerging pollutants are continually appearing, derived from deterioration is often greater than the human capacity to take drugs or cosmetics, whose effects on health are not always corrective and management measures.
Stretchable Arduinos embedded in soft robots
Woodman, Stephanie J., Shah, Dylan S., Landesberg, Melanie, Agrawala, Anjali, Kramer-Bottiglio, Rebecca
To achieve real-world functionality, robots must have the ability to carry out decision-making computations. However, soft robots stretch and therefore need a solution other than rigid computers. Examples of embedding computing capacity into soft robots currently include appending rigid printed circuit boards (PCBs) to the robot, integrating soft logic gates, and exploiting material responses for material-embedded computation. Although promising, these approaches introduce limitations such as rigidity, tethers, or low logic gate density. The field of stretchable electronics has sought to solve these challenges, but a complete pipeline for direct integration of single-board computers, microcontrollers, and other complex circuitry into soft robots has remained elusive. We present a generalized method to translate any complex two-layer circuit into a soft, stretchable form. This enabled the creation of stretchable single-board microcontrollers (including Arduinos) and other commercial circuits (including Sparkfun circuits), without design simplifications. As demonstrations of the method's utility, we embed highly stretchable (>300% strain) Arduino Pro Minis into the bodies of multiple soft robots. This makes use of otherwise inert structural material, fulfilling the promise of the stretchable electronics field to integrate state-of-the-art computational power into robust, stretchable systems during active use.
Uncovering the Mechanism of Hepatotoxiciy of PFAS Targeting L-FABP Using GCN and Computational Modeling
Jividen, Lucas, Duran, Tibo, Niu, Xi-Zhi, Bai, Jun
Per- and polyfluoroalkyl substances (PFAS) are persistent environmental pollutants with known toxicity and bioaccumulation issues. Their widespread industrial use and resistance to degradation have led to global environmental contamination and significant health concerns. While a minority of PFAS have been extensively studied, the toxicity of many PFAS remains poorly understood due to limited direct toxicological data. This study advances the predictive modeling of PFAS toxicity by combining semi-supervised graph convolutional networks (GCNs) with molecular descriptors and fingerprints. We propose a novel approach to enhance the prediction of PFAS binding affinities by isolating molecular fingerprints to construct graphs where then descriptors are set as the node features. This approach specifically captures the structural, physicochemical, and topological features of PFAS without overfitting due to an abundance of features. Unsupervised clustering then identifies representative compounds for detailed binding studies. Our results provide a more accurate ability to estimate PFAS hepatotoxicity to provide guidance in chemical discovery of new PFAS and the development of new safety regulations.
How to do impactful research in artificial intelligence for chemistry and materials science
Cheng, Austin, Ser, Cher Tian, Skreta, Marta, Guzmán-Cordero, Andrés, Thiede, Luca, Burger, Andreas, Aldossary, Abdulrahman, Leong, Shi Xuan, Pablo-García, Sergio, Strieth-Kalthoff, Felix, Aspuru-Guzik, Alán
Machine learning (ML) has been applied in many facets of chemistry, and its use is rapidly growing. We argue in this perspective that despite this dramatic growth and impact, ML could be employed better and more extensively. Current work is still far from exhausting the potential of ML to advance theory and application in chemistry in terms of breadth, depth, and scale. In addition, the actual types of problems that ML could tackle, such as hypothesis generation or enabling internalized scientific understanding, are still areas of active research or open problems.
LLM-DER:A Named Entity Recognition Method Based on Large Language Models for Chinese Coal Chemical Domain
Xiao, Le, Xu, Yunfei, Zhao, Jing
Domain-specific Named Entity Recognition (NER), whose goal is to recognize domain-specific entities and their categories, provides an important support for constructing domain knowledge graphs. Currently, deep learning-based methods are widely used and effective in NER tasks, but due to the reliance on large-scale labeled data. As a result, the scarcity of labeled data in a specific domain will limit its application.Therefore, many researches started to introduce few-shot methods and achieved some results. However, the entity structures in specific domains are often complex, and the current few-shot methods are difficult to adapt to NER tasks with complex features.Taking the Chinese coal chemical industry domain as an example,there exists a complex structure of multiple entities sharing a single entity, as well as multiple relationships for the same pair of entities, which affects the NER task under the sample less condition.In this paper, we propose a Large Language Models (LLMs)-based entity recognition framework LLM-DER for the domain-specific entity recognition problem in Chinese, which enriches the entity information by generating a list of relationships containing entity types through LLMs, and designing a plausibility and consistency evaluation method to remove misrecognized entities, which can effectively solve the complex structural entity recognition problem in a specific domain.The experimental results of this paper on the Resume dataset and the self-constructed coal chemical dataset Coal show that LLM-DER performs outstandingly in domain-specific entity recognition, not only outperforming the existing GPT-3.5-turbo baseline, but also exceeding the fully-supervised baseline, verifying its effectiveness in entity recognition.
A Comparative Study of Open Source Computer Vision Models for Application on Small Data: The Case of CFRP Tape Laying
Fraunholz, Thomas, Rall, Dennis, Köhler, Tim, Schuster, Alfons, Mayer, Monika, Larsen, Lars
In the realm of industrial manufacturing, Artificial Intelligence (AI) is playing an increasing role, from automating existing processes to aiding in the development of new materials and techniques. However, a significant challenge arises in smaller, experimental processes characterized by limited training data availability, questioning the possibility to train AI models in such small data contexts. In this work, we explore the potential of Transfer Learning to address this challenge, specifically investigating the minimum amount of data required to develop a functional AI model. For this purpose, we consider the use case of quality control of Carbon Fiber Reinforced Polymer (CFRP) tape laying in aerospace manufacturing using optical sensors. We investigate the behavior of different open-source computer vision models with a continuous reduction of the training data. Our results show that the amount of data required to successfully train an AI model can be drastically reduced, and the use of smaller models does not necessarily lead to a loss of performance.
A clustering adaptive Gaussian process regression method: response patterns based real-time prediction for nonlinear solid mechanics problems
Li, Ming-Jian, Lian, Yanping, Cheng, Zhanshan, Li, Lehui, Wang, Zhidong, Gao, Ruxin, Fang, Daining
Numerical simulation is powerful to study nonlinear solid mechanics problems. However, mesh-based or particle-based numerical methods suffer from the common shortcoming of being time-consuming, particularly for complex problems with real-time analysis requirements. This study presents a clustering adaptive Gaussian process regression (CAG) method aiming for real-time prediction for nonlinear structural responses in solid mechanics. It is a data-driven machine learning method featuring a small sample size, high accuracy, and high efficiency, leveraging nonlinear structural response patterns. Similar to the traditional Gaussian process regression (GPR) method, it operates in offline and online stages. In the offline stage, an adaptive sample generation technique is introduced to cluster datasets into distinct patterns for demand-driven sample allocation. This ensures comprehensive coverage of the critical samples for the solution space of interest. In the online stage, following the divide-and-conquer strategy, a pre-prediction classification categorizes problems into predefined patterns sequentially predicted by the trained multi-pattern Gaussian process regressor. In addition, dimension reduction and restoration techniques are employed in the proposed method to enhance its efficiency. A set of problems involving material, geometric, and boundary condition nonlinearities is presented to demonstrate the CAG method's abilities. The proposed method can offer predictions within a second and attain high precision with only about 20 samples within the context of this study, outperforming the traditional GPR using uniformly distributed samples for error reductions ranging from 1 to 3 orders of magnitude. The CAG method is expected to offer a powerful tool for real-time prediction of nonlinear solid mechanical problems and shed light on the complex nonlinear structural response pattern.