Goto

Collaborating Authors

 job size



Reinforcement Learning-based Adaptive Mitigation of Uncorrected DRAM Errors in the Field

arXiv.org Artificial Intelligence

Scaling to larger systems, with current levels of reliability, requires cost-effective methods to mitigate hardware failures. One of the main causes of hardware failure is an uncorrected error in memory, which terminates the current job and wastes all computation since the last checkpoint. This paper presents the first adaptive method for triggering uncorrected error mitigation. It uses a prediction approach that considers the likelihood of an uncorrected error and its current potential cost. The method is based on reinforcement learning, and the only user-defined parameters are the mitigation cost and whether the job can be restarted from a mitigation point. We evaluate our method using classical machine learning metrics together with a cost-benefit analysis, which compares the cost of mitigation actions with the benefits from mitigating some of the errors. On two years of production logs from the MareNostrum supercomputer, our method reduces lost compute time by 54% compared with no mitigation and is just 6% below the optimal Oracle method. All source code is open source.


Non-clairvoyant Scheduling with Partial Predictions

arXiv.org Artificial Intelligence

The non-clairvoyant scheduling problem has gained new interest within learning-augmented algorithms, where the decision-maker is equipped with predictions without any quality guarantees. In practical settings, access to predictions may be reduced to specific instances, due to cost or data limitations. Our investigation focuses on scenarios where predictions for only $B$ job sizes out of $n$ are available to the algorithm. We first establish near-optimal lower bounds and algorithms in the case of perfect predictions. Subsequently, we present a learning-augmented algorithm satisfying the robustness, consistency, and smoothness criteria, and revealing a novel tradeoff between consistency and smoothness inherent in the scenario with a restricted number of predictions.


Learning to Optimize Permutation Flow Shop Scheduling via Graph-based Imitation Learning

arXiv.org Artificial Intelligence

The permutation flow shop scheduling (PFSS), aiming at finding the optimal permutation of jobs, is widely used in manufacturing systems. When solving large-scale PFSS problems, traditional optimization algorithms such as heuristics could hardly meet the demands of both solution accuracy and computational efficiency, thus learning-based methods have recently garnered more attention. Some work attempts to solve the problems by reinforcement learning methods, which suffer from slow convergence issues during training and are still not accurate enough regarding the solutions. To that end, we propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Moreover, in order to extract better feature representations of input jobs, we incorporate the graph structure as the encoder. The extensive experiments reveal that our proposed model obtains significant promotion and presents excellent generalizability in large-scale problems with up to 1000 jobs. Compared to the state-of-the-art reinforcement learning method, our model's network parameters are reduced to only 37\% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8\% to 1.3\% on average. The code is available at: \url{https://github.com/longkangli/PFSS-IL}.


Algorithms with Prediction Portfolios

arXiv.org Artificial Intelligence

The research area of algorithms with predictions has seen recent success showing how to incorporate machine learning into algorithm design to improve performance when the predictions are correct, while retaining worst-case guarantees when they are not. Most previous work has assumed that the algorithm has access to a single predictor. However, in practice, there are many machine learning methods available, often with incomparable generalization guarantees, making it hard to pick a best method a priori. In this work we consider scenarios where multiple predictors are available to the algorithm and the question is how to best utilize them. Ideally, we would like the algorithm's performance to depend on the quality of the best predictor. However, utilizing more predictions comes with a cost, since we now have to identify which prediction is the best. We study the use of multiple predictors for a number of fundamental problems, including matching, load balancing, and non-clairvoyant scheduling, which have been well-studied in the single predictor setting. For each of these problems we introduce new algorithms that take advantage of multiple predictors, and prove bounds on the resulting performance.


On the Runtime of Randomized Local Search and Simple Evolutionary Algorithms for Dynamic Makespan Scheduling

AAAI Conferences

Evolutionary algorithms have been frequently used for dynamic optimization problems. With this paper, we contribute to the theoretical understanding of this research area. We present the first computational complexity analysis of evolutionary algorithms for a dynamic variant of a classical combinatorial optimization problem, namely makespan scheduling. We study the model of a strong adversary which is allowed to change one job at regular intervals. Furthermore, we investigate the setting of random changes.