Improving Existing Fault Recovery Policies

Apr-6-2023, 13:54:01 GMT–Neural Information Processing Systems

Automated recovery from failures is a key component in the management of large data centers. Such systems typically employ a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we explain how to use data gathered from the interactions of the hand-made controller with the system, to create an optimized controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm.

controller, existing fault recovery policy, hand-made controller

Neural Information Processing Systems

Apr-6-2023, 13:54:01 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)