Materials
Coffee producers worldwide grapple with new environmental laws aimed at protecting forests
Figure has developed a full-body humanoid robot, Figure-01, that can walk, talk and interact. Le Van Tam is no stranger to how the vagaries of global trade can determine the fortunes of small coffee farmers like him. He first planted coffee in a patch of land outside Buon Ma Thuot city in Vietnam's Central Highland region in 1995. For years, his focus was on quantity, not quality. Tam used ample amounts of fertilizer and pesticides to boost his yields, and global prices determined how well he did.
Manufacturing Service Capability Prediction with Graph Neural Networks
Li, Yunqing, Liu, Xiaorui, Starly, Binil
In the current landscape, the predominant methods for identifying manufacturing capabilities from manufacturers rely heavily on keyword matching and semantic matching. However, these methods often fall short by either overlooking valuable hidden information or misinterpreting critical data. Consequently, such approaches result in an incomplete identification of manufacturers' capabilities. This underscores the pressing need for data-driven solutions to enhance the accuracy and completeness of manufacturing capability identification. To address the need, this study proposes a Graph Neural Network-based method for manufacturing service capability identification over a knowledge graph. To enhance the identification performance, this work introduces a novel approach that involves aggregating information from the graph nodes' neighborhoods as well as oversampling the graph data, which can be effectively applied across a wide range of practical scenarios. Evaluations conducted on a Manufacturing Service Knowledge Graph and subsequent ablation studies demonstrate the efficacy and robustness of the proposed approach. This study not only contributes a innovative method for inferring manufacturing service capabilities but also significantly augments the quality of Manufacturing Service Knowledge Graphs.
Is There a One-Model-Fits-All Approach to Information Extraction? Revisiting Task Definition Biases
Huang, Wenhao, He, Qianyu, Li, Zhixu, Liang, Jiaqing, Xiao, Yanghua
Definition bias is a negative phenomenon that can mislead models. Definition bias in information extraction appears not only across datasets from different domains but also within datasets sharing the same domain. We identify two types of definition bias in IE: bias among information extraction datasets and bias between information extraction datasets and instruction tuning datasets. To systematically investigate definition bias, we conduct three probing experiments to quantitatively analyze it and discover the limitations of unified information extraction and large language models in solving definition bias. To mitigate definition bias in information extraction, we propose a multi-stage framework consisting of definition bias measurement, bias-aware fine-tuning, and task-specific bias mitigation. Experimental results demonstrate the effectiveness of our framework in addressing definition bias. Resources of this paper can be found at https://github.com/EZ-hwh/definition-bias
Ghost Sentence: A Tool for Everyday Users to Copyright Data from Large Language Models
Zhao, Shuai, Zhu, Linchao, Quan, Ruijie, Yang, Yi
Web user data plays a central role in the ecosystem of pre-trained large language models (LLMs) and their fine-tuned variants. Billions of data are crawled from the web and fed to LLMs. How can \textit{\textbf{everyday web users}} confirm if LLMs misuse their data without permission? In this work, we suggest that users repeatedly insert personal passphrases into their documents, enabling LLMs to memorize them. These concealed passphrases in user documents, referred to as \textit{ghost sentences}, once they are identified in the generated content of LLMs, users can be sure that their data is used for training. To explore the effectiveness and usage of this copyrighting tool, we define the \textit{user training data identification} task with ghost sentences. Multiple datasets from various sources at different scales are created and tested with LLMs of different sizes. For evaluation, we introduce a last $k$ words verification manner along with two metrics: document and user identification accuracy. In the specific case of instruction tuning of a 3B LLaMA model, 11 out of 16 users with ghost sentences identify their data within the generation content. These 16 users contribute 383 examples to $\sim$1.8M training documents. For continuing pre-training of a 1.1B TinyLlama model, 61 out of 64 users with ghost sentences identify their data within the LLM output. These 64 users contribute 1156 examples to $\sim$10M training documents.
PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents
Zhang, Nan, Heaton, Connor, Okonsky, Sean Timothy, Mitra, Prasenjit, Toraman, Hilal Ezgi
Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.
Varroa destructor detection on honey bees using hyperspectral imagery
Duma, Zina-Sabrina, Zemcik, Tomas, Bilik, Simon, Sihvonen, Tuomas, Honec, Peter, Reinikainen, Satu-Pia, Horak, Karel
Hyperspectral (HS) imagery in agriculture is becoming increasingly common. These images have the advantage of higher spectral resolution. Advanced spectral processing techniques are required to unlock the information potential in these HS images. The present paper introduces a method rooted in multivariate statistics designed to detect parasitic Varroa destructor mites on the body of western honey bee Apis mellifera, enabling easier and continuous monitoring of the bee hives. The methodology explores unsupervised (K-means++) and recently developed supervised (Kernel Flows - Partial Least-Squares, KF-PLS) methods for parasitic identification. Additionally, in light of the emergence of custom-band multispectral cameras, the present research outlines a strategy for identifying the specific wavelengths necessary for effective bee-mite separation, suitable for implementation in a custom-band camera. Illustrated with a real-case dataset, our findings demonstrate that as few as four spectral bands are sufficient for accurate parasite identification.
Considerations in the use of ML interaction potentials for free energy calculations
Mendible, Orlando A., Whitmer, Jonathan K., Colón, Yamil J.
Machine learning potentials (MLPs) offer the potential to accurately model the energy and free energy landscapes of molecules with the precision of quantum mechanics and an efficiency similar to classical simulations. This research focuses on using equivariant graph neural networks MLPs due to their proven effectiveness in modeling equilibrium molecular trajectories. A key issue addressed is the capability of MLPs to accurately predict free energies and transition states by considering both the energy and the diversity of molecular configurations. We examined how the distribution of collective variables (CVs) in the training data affects MLP accuracy in determining the free energy surface (FES) of systems, using Metadynamics simulations for butane and alanine dipeptide (ADP). The study involved training forty-three MLPs, half based on classical molecular dynamics data and the rest on ab initio computed energies. The MLPs were trained using different distributions that aim to replicate hypothetical scenarios of sampled CVs obtained if the underlying FES of the system was unknown. Findings for butane revealed that training data coverage of key FES regions ensures model accuracy regardless of CV distribution. However, missing significant FES regions led to correct potential energy predictions but failed free energy reconstruction. For ADP, models trained on classical dynamics data were notably less accurate, while ab initio-based MLPs predicted potential energy well but faltered on free energy predictions. These results emphasize the challenge of assembling an all-encompassing training set for accurate FES prediction and highlight the importance of understanding the FES in preparing training data. The study points out the limitations of MLPs in free energy calculations, stressing the need for comprehensive data that encompasses the system's full FES for effective model training.
How scanning probe microscopy can be supported by Artificial Intelligence and quantum computing
Pregowska, Agnieszka, Roszkiewicz, Agata, Osial, Magdalena, Giersig, Michael
How scanning probe microscopy can be supported by Artificial Intelligence and quantum computing? Institute of Fundamental Technological Research, Polish Academy of Sciences, Pawinskiego 5B, 02-106 Warsaw, Poland; aprego@ippt.pan.pl Abstract--The impact of Artificial Intelligence (AI) is expanding rapidly, revolutionizing both science and society. It is applied to practically all areas of life, science, and technology, including materials science, which continuously needs novel tools for effective materials characterization. One of the widely used techniques is scanning probe microscopy (SPM). SPM has fundamentally changed materials engineering, biology, and chemistry by delivering tools for atomic-precision surface mapping. Besides many advantages, it also has some drawbacks, eg. In this paper, we focus on the potential possibilities for supporting SPM-based measurements, putting emphasis on the application of AI-based algorithms, especially Machine Learning-based algorithms as well as quantum computing (QC). It turned out that AI can be helpful in the experimental processes automation in routine operations, the algorithmic search for good sample regions, and shed light on the structure-property relationships. Thus, it contributes to increasing the efficiency and accuracy of optical nanoscopy scanning probes. Moreover, the combination of AIbased algorithms and QC may have a huge potential to increase the practical application of SPM. The limitations of the AI-QC-based approach were also discussed. Finally, we outline a research path for the improvement of AI-QC-powered SPM. I. INTRODUCTION scanning near field optical microscopy (SNOM) are universal tools for materials' surface characterization. SPM enables to obtain a high-resolution 3D surface profile in a nondestructive measurement.
Synthesizing multi-log grasp poses
Fälldin, Arvid, Wallin, Erik, Löfstedt, Tommy, Servin, Martin
Multi-object grasping is a challenging task. It is important for energy and cost-efficient operation of industrial crane manipulators, such as those used to collect tree logs off the forest floor and onto forest machines. In this work, we used synthetic data from physics simulations to explore how data-driven modeling can be used to infer multi-object grasp poses from images. We showed that convolutional neural networks can be trained specifically for synthesizing multi-object grasps. Using RGB-Depth images and instance segmentation masks as input, a U-Net model outputs grasp maps with corresponding grapple orientation and opening width. Given an observation of a pile of logs, the model can be used to synthesize and rate the possible grasp poses and select the most suitable one, with the possibility to respect changing operational constraints such as lift capacity and reach. When tested on previously unseen data, the proposed model found successful grasp poses with an accuracy of 95%.
NuGraph2: A Graph Neural Network for Neutrino Physics Event Reconstruction
Hewes, V, Aurisano, Adam, Cerati, Giuseppe, Kowalkowski, Jim, Lee, Claire, Liao, Wei-keng, Grzenda, Daniel, Gumpula, Kaushal, Zhang, Xiaohe
Liquid Argon Time Projection Chamber (LArTPC) detector technology offers a wealth of high-resolution information on particle interactions, and leveraging that information to its full potential requires sophisticated automated reconstruction techniques. This article describes NuGraph2, a Graph Neural Network (GNN) for low-level reconstruction of simulated neutrino interactions in a LArTPC detector. Simulated neutrino interactions in the MicroBooNE detector geometry are described as heterogeneous graphs, with energy depositions on each detector plane forming nodes on planar subgraphs. The network utilizes a multi-head attention message-passing mechanism to perform background filtering and semantic labelling on these graph nodes, identifying those associated with the primary physics interaction with 98.0\% efficiency and labelling them according to particle type with 94.9\% efficiency. The network operates directly on detector observables across multiple 2D representations, but utilizes a 3D-context-aware mechanism to encourage consistency between these representations. Model inference takes 0.12 s/event on a CPU, and 0.005 s/event batched on a GPU. This architecture is designed to be a general-purpose solution for particle reconstruction in neutrino physics, with the potential for deployment across a broad range of detector technologies, and offers a core convolution engine that can be leveraged for a variety of tasks beyond the two described in this article.