Wang, Che
Knowledge Graph Construction in Power Distribution Networks
Li, Xiang, Wang, Che, Li, Bing, Chen, Hao, Li, Sizhe
In this paper, we propose a method for knowledge graph construction in power distribution networks. This method leverages entity features, which involve their semantic, phonetic, and syntactic characteristics, in both the knowledge graph of distribution network and the dispatching texts. An enhanced model based on Convolutional Neural Network, is utilized for effectively matching dispatch text entities with those in the knowledge graph. The effectiveness of this model is evaluated through experiments in real-world power distribution dispatch scenarios. The results indicate that, compared with the baselines, the proposed model excels in linking a variety of entity types, demonstrating high overall accuracy in power distribution knowledge graph construction task.
Dynamic Fault Characteristics Evaluation in Power Grid
Pei, Hao, Lin, Si, Li, Chuanfu, Wang, Che, Chen, Haoming, Li, Sizhe
To enhance the intelligence degree in operation and maintenance, a novel method for fault detection in power grids is proposed. The proposed GNN-based approach first identifies fault nodes through a specialized feature extraction method coupled with a knowledge graph. By incorporating temporal data, the method leverages the status of nodes from preceding and subsequent time periods to help current fault detection. To validate the effectiveness of the node features, a correlation analysis of the output features from each node was conducted. The results from experiments show that this method can accurately locate fault nodes in simulation scenarios with a remarkable accuracy. Additionally, the graph neural network based feature modeling allows for a qualitative examination of how faults spread across nodes, which provides valuable insights for analyzing fault nodes.
Pre-training with Synthetic Data Helps Offline Reinforcement Learning
Wang, Zecheng, Wang, Che, Dong, Zixuan, Ross, Keith
Recently, it has been shown that for offline deep reinforcement learning (DRL), pre-training Decision Transformer with a large language corpus can improve downstream performance (Reid et al., 2022). A natural question to ask is whether this performance gain can only be achieved with language pre-training, or can be achieved with simpler pre-training schemes which do not involve language. In this paper, we first show that language is not essential for improved performance, and indeed pre-training with synthetic IID data for a small number of updates can match the performance gains from pre-training with a large language corpus; moreover, pre-training with data generated by a one-step Markov chain can further improve the performance. Inspired by these experimental results, we then consider pre-training Conservative Q-Learning (CQL), a popular offline DRL algorithm, which is Q-learning-based and typically employs a Multi-Layer Perceptron (MLP) backbone. Surprisingly, pre-training with simple synthetic data for a small number of updates can also improve CQL, providing consistent performance improvement on D4RL Gym locomotion datasets. The results of this paper not only illustrate the importance of pre-training for offline DRL but also show that the pre-training data can be synthetic and generated with remarkably simple mechanisms. It is well-known that pre-training can provide significant boosts in performance and robustness for downstream tasks, both for Natural Language Processing (NLP) and Computer Vision (CV). Recently, in the field of Deep Reinforcement Learning (DRL), research on pre-training is also becoming increasingly popular. An important step in the direction of pre-training DRL models is the recent paper by Reid et al. (2022), which showed that for Decision Transformer (Chen et al., 2021), pretraining with the Wikipedia corpus can significantly improve the performance of the downstream offline RL task. Reid et al. (2022) further showed that pre-training on predicting pixel sequences can hurt performance. The authors state that their results indicate "a foreseeable future where everyone should use a pre-trained language model for offline RL".
VRL3: A Data-Driven Framework for Visual Deep Reinforcement Learning
Wang, Che, Luo, Xufang, Ross, Keith, Li, Dongsheng
We propose VRL3, a powerful data-driven framework with a simple design for solving challenging visual deep reinforcement learning (DRL) tasks. We analyze a number of major obstacles in taking a data-driven approach, and present a suite of design principles, novel findings, and critical insights about data-driven visual DRL. Our framework has three stages: in stage 1, we leverage non-RL datasets (e.g. ImageNet) to learn task-agnostic visual representations; in stage 2, we use offline RL data (e.g. a limited number of expert demonstrations) to convert the task-agnostic representations into more powerful task-specific representations; in stage 3, we fine-tune the agent with online RL. On a set of challenging hand manipulation tasks with sparse reward and realistic visual inputs, compared to the previous SOTA, VRL3 achieves an average of 780% better sample efficiency. And on the hardest task, VRL3 is 1220% more sample efficient (2440% when using a wider encoder) and solves the task with only 10% of the computation. These significant results clearly demonstrate the great potential of data-driven deep reinforcement learning.
Aggressive Q-Learning with Ensembles: Achieving Both High Sample Efficiency and High Asymptotic Performance
Wu, Yanqiu, Chen, Xinyue, Wang, Che, Zhang, Yiming, Zhou, Zijian, Ross, Keith W.
Recently, Truncated Quantile Critics (TQC), using distributional representation of critics, was shown to provide state-of-the-art asymptotic training performance on all environments from the MuJoCo continuous control benchmark suite. Also recently, Randomized Ensemble Double Q-Learning (REDQ), using a high updateto-data ratio and target randomization, was shown to achieve high sample efficiency that is competitive with state-of-the-art model-based methods. In this paper, we propose a novel model-free algorithm, Aggressive Q-Learning with Ensembles (AQE), which improves the sample-efficiency performance of REDQ and the asymptotic performance of TQC, thereby providing overall state-of-the-art performance during all stages of training. Moreover, AQE is very simple, requiring neither distributional representation of critics nor target randomization. Off-policy Deep Reinforcement Learning algorithms aim to improve sample efficiency by reusing past experience. A number of off-policy Deep RL algorithms have been proposed for control tasks with continuous state and action spaces, including Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3) and Soft Actor Critic (SAC) (Lillicrap et al., 2016; Fujimoto et al., 2018; Haarnoja et al., 2018a;b). TD3 introduced clipped double-Q learning, and was shown to be significantly more sample efficient than popular on-policy methods for a wide range of MuJoCo benchmarks. Soft Actor Critic (SAC) has similar off-policy structures with clipped double-Q learning, but it also employs maximum entropy reinforcement learning. SAC was shown to provide excellent sample efficiency and asymptotic performance in a wide-range of MuJoCo environments, including the high-dimensional Humanoid environment for which both DDPG and TD3 perform poorly.
Randomized Ensembled Double Q-Learning: Learning Fast Without a Model
Chen, Xinyue, Wang, Che, Zhou, Zijian, Ross, Keith
Using a high Update-To-Data (UTD) ratio, model-based methods have recently achieved much higher sample efficiency than previous model-free methods for continuous-action DRL benchmarks. In this paper, we introduce a simple model-free algorithm, Randomized Ensembled Double Q-Learning (REDQ), and show that its performance is just as good as, if not better than, a state-of-the-art model-based algorithm for the MuJoCo benchmark. Moreover, REDQ can achieve this performance using fewer parameters than the model-based method, and with less wall-clock run time. REDQ has three carefully integrated ingredients which allow it to achieve its high performance: (i) a UTD ratio >> 1; (ii) an ensemble of Q functions; (iii) in-target minimization across a random subset of Q functions from the ensemble. Through carefully designed experiments, we provide a detailed analysis of REDQ and related model-free algorithms. To our knowledge, REDQ is the first successful model-free DRL algorithm for continuous-action spaces using a UTD ratio >> 1.
BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning
Chen, Xinyue, Zhou, Zijian, Wang, Zheng, Wang, Che, Wu, Yanqiu, Deng, Qing, Ross, Keith
The field of Deep Reinforcement Learning (DRL) has recently seen a surge in research in batch reinforcement learning, which aims for sample-efficient learning from a given data set without additional interactions with the environment. In the batch DRL setting, commonly employed off-policy DRL algorithms can perform poorly and sometimes even fail to learn altogether. In this paper, we propose a new algorithm, Best-Action Imitation Learning (BAIL), which unlike many off-policy DRL algorithms does not involve maximizing Q functions over the action space. Striving for simplicity as well as performance, BAIL first selects from the batch the actions it believes to be high-performing actions for their corresponding states; it then uses those state-action pairs to train a policy network using imitation learning. Although BAIL is simple, we demonstrate that BAIL achieves state of the art performance on the Mujoco benchmark.
Towards Simplicity in Deep Reinforcement Learning: Streamlined Off-Policy Learning
Wang, Che, Wu, Yanqiu, Vuong, Quan, Ross, Keith
A BSTRACT The field of Deep Reinforcement Learning (DRL) has recently seen a surge in the popularity of maximum entropy reinforcement learning algorithms. Their popularity stems from the intuitive interpretation of the maximum entropy objective and their superior sample efficiency on standard benchmarks. In this paper, we seek to understand the primary contribution of the entropy term to the performance of maximum entropy algorithms. For the Mujoco benchmark, we demonstrate that the entropy term in Soft Actor Critic (SAC) principally addresses the bounded nature of the action spaces. With this insight, we propose a simple normalization scheme which allows a streamlined algorithm without entropy maximization match the performance of SAC. Our experimental results demonstrate a need to revisit the benefits of entropy regularization in DRL. We also propose a simple nonuniform sampling method for selecting transitions from the replay buffer during training. We further show that the streamlined algorithm with the simple nonuniform sampling scheme outperforms SAC and achieves state-of-the-art performance on challenging continuous control tasks. 1 I NTRODUCTION Off-policy deep Reinforcement Learning (RL) algorithms aim to improve sample efficiency by reusing past experience. Recently a number of new off-policy Deep Reinforcement Learning algorithms have been proposed for control tasks with continuous state and action spaces, including Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3) (Lillicrap et al., 2015; Fuji-moto et al., 2018). TD3, in particular, has been shown to be significantly more sample efficient than popular on-policy methods for a wide range of Mujoco benchmarks. The field of Deep Reinforcement Learning (DRL) has also recently seen a surge in the popularity of maximum entropy reinforcement learning algorithms. Their popularity stems from the intuitive interpretation of the maximum entropy objective and their superior sample efficiency on standard benchmarks.
Boosting Soft Actor-Critic: Emphasizing Recent Experience without Forgetting the Past
Wang, Che, Ross, Keith
Soft Actor-Critic (SAC) [10, 11] is an off-policy actor-critic deep reinforcement learning (DRL) algorithm based on maximum entropy reinforcement learning. By combining off-policy updates with an actor-critic formulation, SAC achieves state-of-the-art performance on a range of continuous-action benchmark tasks, outperforming prior on-policy and off-policy methods. The off-policy method employed by SAC samples data uniformly from past experience when performing parameter updates. We propose Emphasizing Recent Experience (ERE), a simple but powerful off-policy sampling technique, which emphasizes recently observed data while not forgetting the past. The ERE algorithm samples more aggressively from recent experience, and also orders the updates to ensure that updates from old data do not overwrite updates from new data. We compare vanilla SAC and SAC ERE, and show that ERE is more sample efficient than vanilla SAC for continuous-action Mujoco tasks [31]. We also consider combining SAC with Priority Experience Replay (PER) [28], a scheme originally proposed for deep Q-learning which prioritizes the data based on temporal-difference (TD) error. We show that SAC PER can marginally improve the sample efficiency performance of SAC, but much less so than SAC ERE. Finally, we propose an algorithm which integrates ERE and PER and show that this hybrid algorithm can give the best results for some of the Mujoco tasks.
Portfolio Online Evolution in StarCraft
Wang, Che (New York University) | Chen, Pan (New York University) | Li, Yuanda (New York University) | Holmgård, Christoffer (New York University) | Togelius, Julian (New York University)
Portfolio Online Evolution is a novel method for playing real-time strategy games through evolutionary search in the space of assignments of scripts to individual game units. This method builds on and recombines two recently devised methods for playing multi-action games: (1) Portfolio Greedy Search, which searches in the space of heuristics assigned to units rather than in the space of actions, and (2) Online Evolution, which uses evolution rather than tree search to effectively play games where multiple actions per turn lead to enormous branching factors. The combination of both ideas lead to the use of evolution to search the space of which script/heuristic is assigned to which unit. In this paper, we introduce the ideas of Portfolio Online Evolution and apply it to StarCraft micro, or individual battles. It is shown to outperform all other tested methods in battles of moderate to large size.