Brazil, Emilio Vital
Improving Molecular Properties Prediction Through Latent Space Fusion
Soares, Eduardo, Kishimoto, Akihiro, Brazil, Emilio Vital, Takeda, Seiji, Kajino, Hiroshi, Cerqueira, Renato
Pre-trained Language Models have emerged as promising tools for predicting molecular properties, yet their development is in its early stages, necessitating further research to enhance their efficacy and address challenges such as generalization and sample efficiency. In this paper, we present a multi-view approach that combines latent spaces derived from state-of-the-art chemical models. Our approach relies on two pivotal elements: the embeddings derived from MHG-GNN, which represent molecular structures as graphs, and MoLFormer embeddings rooted in chemical language. The attention mechanism of MoLFormer is able to identify relations between two atoms even when their distance is far apart, while the GNN of MHG-GNN can more precisely capture relations among multiple atoms closely located. In this work, we demonstrate the superior performance of our proposed multi-view approach compared to existing state-of-the-art methods, including MoLFormer-XL, which was trained on 1.1 billion molecules, particularly in intricate tasks such as predicting clinical trial drug toxicity and inhibiting HIV replication. We assessed our approach using six benchmark datasets from MoleculeNet, where it outperformed competitors in five of them. Our study highlights the potential of latent space fusion and feature integration for advancing molecular property prediction. In this work, we use small versions of MHG-GNN and MoLFormer, which opens up an opportunity for further improvement when our approach uses a larger-scale dataset.
Beyond Chemical Language: A Multimodal Approach to Enhance Molecular Property Prediction
Soares, Eduardo, Brazil, Emilio Vital, Gutierrez, Karen Fiorela Aquino, Cerqueira, Renato, Sanders, Dan, Schmidt, Kristin, Zubarev, Dmitry
We present a novel multimodal language model approach for predicting molecular properties by combining chemical language representation with physicochemical features. Our approach, MULTIMODAL-MOLFORMER, utilizes a causal multistage feature selection method that identifies physicochemical features based on their direct causal effect on a specific target property. These causal features are then integrated with the vector space generated by molecular embeddings from MOLFORMER. In particular, we employ Mordred descriptors as physicochemical features and identify the Markov blanket of the target property, which theoretically contains the most relevant features for accurate prediction. Our results demonstrate a superior performance of our proposed approach compared to existing state-of-the-art algorithms, including the chemical language-based MOLFORMER and graph neural networks, in predicting complex tasks such as biodegradability and PFAS toxicity estimation. Moreover, we demonstrate the effectiveness of our feature selection method in reducing the dimensionality of the Mordred feature space while maintaining or improving the model's performance. Our approach opens up promising avenues for future research in molecular property prediction by harnessing the synergistic potential of both chemical language and physicochemical features, leading to enhanced performance and advancements in the field.
Position Paper on Dataset Engineering to Accelerate Science
Brazil, Emilio Vital, Soares, Eduardo, Real, Lucas Villa, Azevedo, Leonardo, Segura, Vinicius, Zerkowski, Luiz, Cerqueira, Renato
Data is a critical element in any discovery process. In the last decades, we observed exponential growth in the volume of available data and the technology to manipulate it. However, data is only practical when one can structure it for a well-defined task. For instance, we need a corpus of text broken into sentences to train a natural language machine-learning model. In this work, we will use the token \textit{dataset} to designate a structured set of data built to perform a well-defined task. Moreover, the dataset will be used in most cases as a blueprint of an entity that at any moment can be stored as a table. Specifically, in science, each area has unique forms to organize, gather and handle its datasets. We believe that datasets must be a first-class entity in any knowledge-intensive process, and all workflows should have exceptional attention to datasets' lifecycle, from their gathering to uses and evolution. We advocate that science and engineering discovery processes are extreme instances of the need for such organization on datasets, claiming for new approaches and tooling. Furthermore, these requirements are more evident when the discovery workflow uses artificial intelligence methods to empower the subject-matter expert. In this work, we discuss an approach to bringing datasets as a critical entity in the discovery process in science. We illustrate some concepts using material discovery as a use case. We chose this domain because it leverages many significant problems that can be generalized to other science fields.
Knowledge-augmented Risk Assessment (KaRA): a hybrid-intelligence framework for supporting knowledge-intensive risk assessment of prospect candidates
Mendes, Carlos Raoni, Brazil, Emilio Vital, Segura, Vinicius, Cerqueira, Renato
Evaluating the potential of a prospective candidate is a common task in multiple decision-making processes in different industries. We refer to a prospect as something or someone that could potentially produce positive results in a given context, e.g., an area where an oil company could find oil, a compound that, when synthesized, results in a material with required properties, and so on. In many contexts, assessing the Probability of Success (PoS) of prospects heavily depends on experts' knowledge, often leading to biased and inconsistent assessments. We have developed the framework named KARA (Knowledge-augmented Risk Assessment) to address these issues. It combines multiple AI techniques that consider SMEs (Subject Matter Experts) feedback on top of a structured domain knowledge-base to support risk assessment processes of prospect candidates in knowledge-intensive contexts.
Toward Human-AI Co-creation to Accelerate Material Discovery
Zubarev, Dmitry, Mendes, Carlos Raoni, Brazil, Emilio Vital, Cerqueira, Renato, Schmidt, Kristin, Segura, Vinicius, Ferreira, Juliana Jansen, Sanders, Dan
There is an increasing need in our society to achieve faster advances in Science to tackle urgent problems, such as climate changes, environmental hazards, sustainable energy systems, pandemics, among others. In certain domains like chemistry, scientific discovery carries the extra burden of assessing risks of the proposed novel solutions before moving to the experimental stage. Despite several recent advances in Machine Learning and AI to address some of these challenges, there is still a gap in technologies to support end-to-end discovery applications, integrating the myriad of available technologies into a coherent, orchestrated, yet flexible discovery process. Such applications need to handle complex knowledge management at scale, enabling knowledge consumption and production in a timely and efficient way for subject matter experts (SMEs). Furthermore, the discovery of novel functional materials strongly relies on the development of exploration strategies in the chemical space. For instance, generative models have gained attention within the scientific community due to their ability to generate enormous volumes of novel molecules across material domains. These models exhibit extreme creativity that often translates in low viability of the generated candidates. In this work, we propose a workbench framework that aims at enabling the human-AI co-creation to reduce the time until the first discovery and the opportunity costs involved. This framework relies on a knowledge base with domain and process knowledge, and user-interaction components to acquire knowledge and advise the SMEs. Currently,the framework supports four main activities: generative modeling, dataset triage, molecule adjudication, and risk assessment.
Workflow Provenance in the Lifecycle of Scientific Machine Learning
Souza, Renan, Azevedo, Leonardo G., Lourenรงo, Vรญtor, Soares, Elton, Thiago, Raphael, Brandรฃo, Rafael, Civitarese, Daniel, Brazil, Emilio Vital, Moreno, Marcio, Valduriez, Patrick, Mattoso, Marta, Cerqueira, Renato, Netto, Marco A. S.
Machine Learning (ML) has been fundamentally transforming several industries and businesses in numerous ways. More recently, it has also been impacting computational science and engineering domains, such as geoscience, climate science, material science, and health science. Scientific ML, i.e., ML applied to these domains, is characterized by the combination of data-driven techniques with domain-specific data and knowledge to obtain models of physical phenomena [1], [2], [3], [4], [5]. Obtaining models in scientific ML works similarly to conducting traditional large-scale computational experiments [6], which involve a team of scientists and engineers that formulate hypotheses, design the experiment and predefine parameters and input datasets, analyze the experiment data, do observations, and calibrate initial assumptions in a cycle until they are satisfied with the results. Scientific ML is naturally large-scale because multiple people collaborate in a project, using their multidisciplinary domain-specific knowledge to design and perform data-intensive tasks to curate (i.e., understand, clean, enrich with observations) datasets and prepare for learning algorithms. They then plan and execute compute-intensive tasks for computational simulations or training ML models affected by the scientific domain's constraints. They utilize specialized scientific software tools running either on their desktops, on cloud clusters (e.g., Docker-based), or large HPC machines.
Netherlands Dataset: A New Public Dataset for Machine Learning in Seismic Interpretation
Silva, Reinaldo Mozart, Baroni, Lais, Ferreira, Rodrigo S., Civitarese, Daniel, Szwarcman, Daniela, Brazil, Emilio Vital
Machine learning and, more specifically, deep learning algorithms have seen remarkable growth in their popularity and usefulness in the last years. This is arguably due to three main factors: powerful computers, new techniques to train deeper networks and larger datasets. Although the first two are readily available in modern computers and ML libraries, the last one remains a challenge for many domains. It is a fact that big data is a reality in almost all fields nowadays, and geosciences are not an exception. However, to achieve the success of general-purpose applications such as ImageNet - for which there are +14 million labeled images for 1000 target classes - we not only need more data, we need more high-quality labeled data. When it comes to the Oil&Gas industry, confidentiality issues hamper even more the sharing of datasets. In this work, we present the Netherlands interpretation dataset, a contribution to the development of machine learning in seismic interpretation. The Netherlands F3 dataset acquisition was carried out in the North Sea, Netherlands offshore. The data is publicly available and contains pos-stack data, 8 horizons and well logs of 4 wells. For the purposes of our machine learning tasks, the original dataset was reinterpreted, generating 9 horizons separating different seismic facies intervals. The interpreted horizons were used to generate approximatelly 190,000 labeled images for inlines and crosslines. Finally, we present two deep learning applications in which the proposed dataset was employed and produced compelling results.