Goto

Collaborating Authors

 Li, Haozhe


Why does Prediction Accuracy Decrease over Time? Uncertain Positive Learning for Cloud Failure Prediction

arXiv.org Artificial Intelligence

With the rapid growth of cloud computing, a variety of software services have been deployed in the cloud. To ensure the reliability of cloud services, prior studies focus on failure instance (disk, node, and switch, etc.) prediction. Once the output of prediction is positive, mitigation actions are taken to rapidly resolve the underlying failure. According to our real-world practice in Microsoft Azure, we find that the prediction accuracy may decrease by about 9% after retraining the models. Considering that the mitigation actions may result in uncertain positive instances since they cannot be verified after mitigation, which may introduce more noise while updating the prediction model. To the best of our knowledge, we are the first to identify this Uncertain Positive Learning (UPLearning) issue in the real-world cloud failure prediction scenario. To tackle this problem, we design an Uncertain Positive Learning Risk Estimator (Uptake) approach. Using two real-world datasets of disk failure prediction and conducting node prediction experiments in Microsoft Azure, which is a top-tier cloud provider that serves millions of users, we demonstrate Uptake can significantly improve the failure prediction accuracy by 5% on average.


Modeling Task Relationships in Multi-variate Soft Sensor with Balanced Mixture-of-Experts

arXiv.org Artificial Intelligence

Accurate estimation of multiple quality variables is critical for building industrial soft sensor models, which have long been confronted with data efficiency and negative transfer issues. Methods sharing backbone parameters among tasks address the data efficiency issue; however, they still fail to mitigate the negative transfer problem. To address this issue, a balanced Mixture-of-Experts (BMoE) is proposed in this work, which consists of a multi-gate mixture of experts (MMoE) module and a task gradient balancing (TGB) module. The MoE module aims to portray task relationships, while the TGB module balances the gradients among tasks dynamically. Both of them cooperate to mitigate the negative transfer problem. Experiments on the typical sulfur recovery unit demonstrate that BMoE models task relationship and balances the training process effectively, and achieves better performance than baseline models significantly.


AttentionMixer: An Accurate and Interpretable Framework for Process Monitoring

arXiv.org Artificial Intelligence

An accurate and explainable automatic monitoring system is critical for the safety of high efficiency energy conversion plants that operate under extreme working condition. Nonetheless, currently available data-driven monitoring systems often fall short in meeting the requirements for either high-accuracy or interpretability, which hinders their application in practice. To overcome this limitation, a data-driven approach, AttentionMixer, is proposed under a generalized message passing framework, with the goal of establishing an accurate and interpretable radiation monitoring framework for energy conversion plants. To improve the model accuracy, the first technical contribution involves the development of spatial and temporal adaptive message passing blocks, which enable the capture of spatial and temporal correlations, respectively; the two blocks are cascaded through a mixing operator. To enhance the model interpretability, the second technical contribution involves the implementation of a sparse message passing regularizer, which eliminates spurious and noisy message passing routes. The effectiveness of the AttentionMixer approach is validated through extensive evaluations on a monitoring benchmark collected from the national radiation monitoring network for nuclear power plants, resulting in enhanced monitoring accuracy and interpretability in practice.


Analyze and Design Network Architectures by Recursion Formulas

arXiv.org Artificial Intelligence

The effectiveness of shortcut/skip-connection has been widely verified, which inspires massive explorations on neural architecture design. This work attempts to find an effective way to design new network architectures. it is discovered that the main difference between network architectures can be reflected in their recursion formulas. Based on this, a methodology is proposed to design novel network architectures from the perspective of mathematical formulas. Afterwards, a case study is provided to generate an improved architecture based on ResNet. Furthermore, the new architecture is compared with ResNet and then tested on ResNet-based networks. Massive experiments are conducted on CIFAR and ImageNet, which witnesses the significant performance improvements provided by the architecture.