An Investigation on Machine Learning Predictive Accuracy Improvement and Uncertainty Reduction using VAE-based Data Augmentation

Alsafadi, Farah, Yaseen, Mahmoud, Wu, Xu

arXiv.org Artificial Intelligence 

However, a unique challenge in nuclear engineering is data scarcity because experimentation on nuclear systems is usually more expensive and time-consuming than most other disciplines. Large amounts of data may be available for certain parts such as pipes, pumps and turbines, etc., due to large network of sensors, but not for many others, such as critical heat flux in thermal-hydraulics experiments, advanced materials qualification data like molten salts and multi-principal element alloys, etc. Particularly concerning is the lack of data for advanced reactor design and safety analysis, raising challenges for utilizing ML in licensing analyses of advanced nuclear reactors. In these cases, we need to move beyond "throw more data and re-train" at the problem, which is the common solution in areas such as computer vision and natural language processing that have access to "big data". One potential way to address the data scarcity issue is data augmentation using deep generative learning. Deep generative learning is an unsupervised ML technique that aims at discovering and learning the regularities or patterns in existing data using deep generative models (DGMs), in order to generate new samples that plausibly could have been drawn from the real dataset. DGMs are typically neural networks (NNs) trained to learn or approximate the underlying distribution of the training data. This enables them to generate synthetic samples that closely match the distribution of the original training data. By employing DGMs for data augmentation, one can significantly expand the training dataset for ML models to achieve better performance in other tasks, such as data-driven predictive ML models. Data augmentation with DGMs is still a relatively new research area in nuclear engineering, but has been studied for a few years in computer vision and natural language processing for datasets involving images, audios, videos, spoken words, etc.