Equivariant Masked Position Prediction for Efficient Molecular Representation
An, Junyi, Qu, Chao, Shi, Yun-Fei, Liu, XinHao, Tang, Qianwei, Cao, Fenglei, Qi, Yuan
–arXiv.org Artificial Intelligence
Graph neural networks (GNNs) have shown considerable promise in computational chemistry. However, the limited availability of molecular data raises concerns regarding GNNs' ability to effectively capture the fundamental principles of physics and chemistry, which constrains their generalization capabilities. To address this challenge, we introduce a novel self-supervised approach termed Equivariant Masked Position Prediction (EMPP), grounded in intramolecular potential and force theory. Unlike conventional attribute masking techniques, EMPP formulates a nuanced position prediction task that is more well-defined and enhances the learning of quantum mechanical features. EMPP also bypasses the approximation of the Gaussian mixture distribution commonly used in denoising methods, allowing for more accurate acquisition of physical properties. Experimental results indicate that EMPP significantly enhances performance of advanced molecular architectures, surpassing state-of-the-art self-supervised approaches. Graph neural networks (GNNs) have found widespread application in computational chemistry. However, unlike other fields such as natural language processing (NLP), the limited availability of molecular data hampers the development of GNNs in this domain. For example, one of the largest molecular dataset, OC20 (Chanussot et al., 2021), contains only 1.38 million samples, and collecting more molecular data with ab initio calculations is both challenging and expensive. To address this limitation, molecular self-supervised learning has gained increasing attention. This approach enables molecular GNNs to learn more general physical and chemical knowledge, enhancing performance in various computational chemistry tasks, such as drug discovery (Hasselgren & Oprea, 2024) and catalyst design (Chanussot et al., 2021). Current self-supervised methods for molecular learning contain two mainstream categories: masking and denoising. Masking methods (Hu et al., 2020; Hou et al., 2022; Inae et al., 2023) adapt the concept of masked token prediction from natural language processing (NLP) to graph learning, where graph information, such as node attribute, is masked instead of token. However, there are two major limitations: underdetermined reconstruction and lack of deep quantum mechanical (QM) insight.
arXiv.org Artificial Intelligence
Feb-12-2025