A million-scale dataset and generalizable foundation model for nanomaterial-protein interactions
Yu, Hengjie, Dawson, Kenneth A., Yang, Haiyun, Liu, Shuya, Yan, Yan, Jin, Yaochu
–arXiv.org Artificial Intelligence
Unlocking the potential of nanomaterials in medicine and environmental science hinges on understanding their interactions with proteins, a complex decision space where AI is poised to make a transformative impact. However, progress has been hindered by limited datasets and the restricted generalizability of existing models. Here, we propose NanoPro-3M, the largest nanomaterial-protein interaction dataset to date, comprising over 3.2 million samples and 37,000 unique proteins. Leveraging this, we present NanoProFormer, a foundational model that predicts nanomaterial-protein affinities through multimodal representation learning, demonstrating strong generalization, handling missing features, and unseen nanomaterials or proteins. We show that multimodal modeling significantly outperforms single-modality approaches and identifies key determinants of corona formation. Furthermore, we demonstrate its applicability to a range of downstream tasks through zero-shot inference and fine-tuning. Together, this work establishes a solid foundation for high-performance and generalized prediction of nanomaterial-protein interaction endpoints, reducing experimental reliance and accelerating various in vitro applications.
arXiv.org Artificial Intelligence
Jul-22-2025
- Country:
- Asia > China
- Zhejiang Province > Hangzhou (0.04)
- Europe
- Ireland (0.04)
- Norway > Norwegian Sea (0.04)
- North America > United States
- Florida > Miami-Dade County
- Miami (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Florida > Miami-Dade County
- Asia > China
- Genre:
- Research Report
- Experimental Study (0.46)
- New Finding (0.68)
- Research Report
- Industry:
- Technology: