When More Data Hurts: Optimizing Data Coverage While Mitigating Diversity Induced Underfitting in an Ultra-Fast Machine-Learned Potential

Gibson, Jason B., Janicki, Tesia D., Hire, Ajinkya C., Bishop, Chris, Lane, J. Matthew D., Hennig, Richard G.

Sep-11-2024–arXiv.org Artificial Intelligence

Machine-learned interatomic potentials (MLIPs) are becoming an essential tool in materials modeling. However, optimizing the generation of training data used to parameterize the MLIPs remains a significant challenge. This is because MLIPs can fail when encountering local enviroments too different from those present in the training data. The difficulty of determining \textit{a priori} the environments that will be encountered during molecular dynamics (MD) simulation necessitates diverse, high-quality training data. This study investigates how training data diversity affects the performance of MLIPs using the Ultra-Fast Force Field (UF$^3$) to model amorphous silicon nitride. We employ expert and autonomously generated data to create the training data and fit four force-field variants to subsets of the data. Our findings reveal a critical balance in training data diversity: insufficient diversity hinders generalization, while excessive diversity can exceed the MLIP's learning capacity, reducing simulation accuracy. Specifically, we found that the UF$^3$ variant trained on a subset of the training data, in which nitrogen-rich structures were removed, offered vastly better prediction and simulation accuracy than any other variant. By comparing these UF$^3$ variants, we highlight the nuanced requirements for creating accurate MLIPs, emphasizing the importance of application-specific training data to achieve optimal performance in modeling complex material behaviors.

mlip, uf 3, variant, (15 more...)

arXiv.org Artificial Intelligence

Sep-11-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Netherlands (0.04)
- North America > United States
  - New Mexico > Bernalillo County
    - Albuquerque (0.04)
  - Florida > Alachua County
    - Gainesville (0.14)

Genre:
- Research Report > New Finding (0.34)

Industry:
- Energy (0.94)
- Government > Regional Government
  - North America Government > United States Government (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found