DeepPNI: Language- and graph-based model for mutation-driven protein-nucleic acid energetics

Mondal, Somnath, Mondal, Tinkal, Pramanik, Soumajit, Mehra, Rukmankesh

arXiv.org Artificial Intelligence 

The interaction between proteins and nucleic acids is crucial for processes that sustain cellular function, including DNA maintenance and the regulation of gene expression and translation. Amino acid mutat ions in protein - nucleic acid complexes often lead to vital disease s . Experimental techniques have their own specific limitations in predicting mutational effects in protein - nucleic acid complexes . In this study, we compiled a large dataset of 1951 mutations including both protein - DNA and protein - RNA complexes and integrate d structural and sequential features to build a deep learning - based regression model named DeepPNI . This model estimates mutation - induced binding free energy changes in protein - nucleic aci d complexes . The structural feature s are encoded via edge - aware RGCN and the sequential feature s are extracted using protein language model ESM - 2. W e have achieved a high average Pearson correlation coeffi cient (PCC) of 0.76 in the large dataset via five - fold cross - validation. Consistent performance across individual dataset of protein - DNA, protein - RNA complexes, and different experimental temperature split dataset make the model g eneralizable . Our model showed g ood performance in complex - based five - fold cross - validation, which prove d its robustness. In addition, DeepPNI outperform ed in e xternal dataset validation, and compar ison with existing tools .