Dai, Dong
ClusterLog: Clustering Logs for Effective Log-based Anomaly Detection
Egersdoerfer, Chris, Dai, Dong, Zhang, Di
With the increasing prevalence of scalable file systems in the context of High Performance Computing (HPC), the importance of accurate anomaly detection on runtime logs is increasing. But as it currently stands, many state-of-the-art methods for log-based anomaly detection, such as DeepLog, have encountered numerous challenges when applied to logs from many parallel file systems (PFSes), often due to their irregularity and ambiguity in time-based log sequences. To circumvent these problems, this study proposes ClusterLog, a log pre-processing method that clusters the temporal sequence of log keys based on their semantic similarity. By grouping semantically and sentimentally similar logs, this approach aims to represent log sequences with the smallest amount of unique log keys, intending to improve the ability of a downstream sequence-based model to effectively learn the log patterns. The preliminary results of ClusterLog indicate not only its effectiveness in reducing the granularity of log sequences without the loss of important sequence information but also its generalizability to different file systems' logs.
RLScheduler: Learn to Schedule HPC Batch Jobs Using Deep Reinforcement Learning
Zhang, Di, Dai, Dong, He, Youbiao, Bao, Forrest Sheng
We present RLScheduler, a deep reinforcement learning based job scheduler for scheduling independent batch jobs in high-performance computing (HPC) environment. From knowing nothing about scheduling at beginning, RLScheduler is able to autonomously learn how to effectively schedule HPC batch jobs, targeting a given optimization goal. This is achieved by deep reinforcement learning with the help of specially designed neural network structures and various optimizations to stabilize and accelerate the learning. Our results show that RLScheduler can outperform existing heuristic scheduling algorithms, including a manually fine-tuned machine learning-based scheduler on the same workload. More importantly, we show that RLScheduler does not blindly over-fit the given workload to achieve such optimization, instead, it learns general rules for scheduling batch jobs which can be further applied to different workloads and systems to achieve similarly optimized performance. We also demonstrate that RLScheduler is capable of adjusting itself along with changing goals and workloads, making it an attractive solution for the future autonomous HPC management.
Rare Disease Physician Targeting: A Factor Graph Approach
Cai, Yong, Wang, Yunlong, Dai, Dong
In rare disease physician targeting, a major challenge is how to identify physicians who are treating diagnosed or underdiagnosed rare diseases patients. Rare diseases have extremely low incidence rate. For a specified rare disease, only a small number of patients are affected and a fractional of physicians are involved. The existing targeting methodologies, such as segmentation and profiling, are developed under mass market assumption. They are not suitable for rare disease market where the target classes are extremely imbalanced. The authors propose a graphical model approach to predict targets by jointly modeling physician and patient features from different data spaces and utilizing the extra relational information. Through an empirical example with medical claim and prescription data, the proposed approach demonstrates better accuracy in finding target physicians. The graph representation also provides visual interpretability of relationship among physicians and patients. The model can be extended to incorporate more complex dependency structures. This article contributes to the literature of exploring the benefit of utilizing relational dependencies among entities in healthcare industry.
Deviation optimal learning using greedy Q-aggregation
Dai, Dong, Rigollet, Philippe, Zhang, Tong
Given a finite family of functions, the goal of model selection aggregation is to construct a procedure that mimics the function from this family that is the closest to an unknown regression function. More precisely, we consider a general regression model with fixed design and measure the distance between functions by the mean squared error at the design points. While procedures based on exponential weights are known to solve the problem of model selection aggregation in expectation, they are, surprisingly, sub-optimal in deviation. We propose a new formulation called Q-aggregation that addresses this limitation; namely, its solution leads to sharp oracle inequalities that are optimal in a minimax sense. Moreover, based on the new formulation, we design greedy Q-aggregation procedures that produce sparse aggregation models achieving the optimal rate. The convergence and performance of these greedy procedures are illustrated and compared with other standard methods on simulated examples.
Greedy Model Averaging
Dai, Dong, Zhang, Tong
This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let $n$ be the sample size, then the worst case regret of the former decays at the rate of $O(1/n)$ while the worst case regret of the latter decays at the rate of $O(1/\sqrt{n})$. In the literature, the most important and widely studied model averaging method that achieves the optimal $O(1/n)$ average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples.