Combining feature-based approaches with graph neural networks and symbolic regression for synergistic performance and interpretability

Gouvêa, Rogério Almeida, De Breuck, Pierre-Paul, Pretto, Tatiane, Rignanese, Gian-Marco, Santos, Marcos José Leite

arXiv.org Artificial Intelligence 

To avoid the featuri zation bottleneck of traditional descriptors, we also leverage GNNs to generate fast, latent-space approximations of MatMiner (ℓ-MM) and Orbital Field Matrix (ℓ-OFM) features. Finally, we augment this feature set with new descriptors derived via symbolic regression. This multifac eted strategy aims to create a more robust, accurate, and versatile featurizer that capitalizes on the distinct strengths of each approach to be useful for a wider range of dataset sizes. To simplify the generation of all those features, a package was developed named MatterVial standing for MATerials fea T uR e E xtraction Via I nterpretable Artificial L earning, which, besides producing all latent-space features from the GNN models, aids i n obtaining the interpretable chemical descriptors that correlate to these high-level features. This is achieved through techniques such as SHapley Additive exPlanations (SHAP) analysi s in surrogate models and symbolic regression via Sure Independence Screening and Sparsifying Operator (SISSO) to obtain an approximate formula from the most important features. Our re sults demonstrate an overall improvement in all analyzed datasets compare d with the baseline MatMiner featurizer. In addition, it surpassed the performance of the individua l GNN models in several cases, indicating that the combination of traditional and l atent-space features leads to a more robust generalization.