Markov Models
Domain Knowledge Aids in Signal Disaggregation; the Example of the Cumulative Water Heater
Belikov, Alexander, Matheron, Guillaume, Sassi, Johan
In this article we present an unsupervised low-frequency method aimed at detecting and disaggregating the power used by Cumulative Water Heaters (CWH) in residential homes. Our model circumvents the inherent difficulty of unsupervised signal disaggregation by using both the shape of a power spike and its time of occurrence to identify the contribution of CWH reliably. Indeed, many CHWs in France are configured to turn on automatically during off-peak hours only, and we are able to use this domain knowledge to aid peak identification despite the low sampling frequency. In order to test our model, we equipped a home with sensors to record the ground-truth consumption of a water heater. We then apply the model to a larger dataset of energy consumption of Hello Watt users consisting of one month of consumption data for 5k homes at 30-minute resolution. In this dataset we successfully identified CWHs in the majority of cases where consumers declared using them. The remaining part is likely due to possible misconfiguration of CWHs, since triggering them during off-peak hours requires specific wiring in the electrical panel of the house. Our model, despite its simplicity, offers promising applications: detection of mis-configured CWHs on off-peak contracts and slow performance degradation.
Adaptative clustering by minimization of the mixing entropy criterion
We present a clustering method and provide a theoretical analysis and an explanation to a phenomenon encountered in the applied statistical literature since the 1990's. This phenomenon is the natural adaptability of the order when using a clustering method derived from the famous EM algorithm. We define a new statistic, the relative entropic order, that represents the number of clumps in the target distribution. We prove in particular that the empirical version of this relative entropic order is consistent. Our approach is easy to implement and has a high potential of applications. Perspectives of this works are algorithmic and theoretical, with possible natural extensions to various cases such as dependent or multidimensional data.
A Note on Target Q-learning For Solving Finite MDPs with A Generative Oracle
Q-learning is one of the most simple yet popular algorithms in the reinforcement learning (RL) community [Sutton and Barto, 2018]. However, Q-learning suffers the divergence issue when (linear) function approximation is applied [Baird, 1995, Tsitsiklis and Van Roy, 1997]. To address this instability issue, a technique called target network is proposed in the famous DQN algorithm [Mnih et al., 2015]. In particular, DQN implements a duplication of the main Q-network (i.e., the so-called target network), which is further used to generate the bootstrap signal for updates. One important feature is that the target network is fixed over intervals. Unlike Q-learning, the learning targets do not change during an interval for DQN. In [Mnih et al., 2015, Table 3], it is reported that the target network contributes a lot to the superior performance of DQN.
How to read AI/ML research papers๏ฟผ
Research papers are the wellspring of ideas pushing the frontiers of cutting edge technologies. Scholars around the globe rely on platforms like arXiv, JSTOR, Reddit, PapersWithCode to get up to speed on the latest in AI and data science. But let's face it, research papers are a hard nut to crack. The time and effort you put into unpacking a research paper might go down the drain unless you have a method/reading strategy in place. To that end, we have put together a tutorial to help you get the best out of research papers.
A Class of Two-Timescale Stochastic EM Algorithms for Nonconvex Latent Variable Models
Expectation-Maximization (EM) algorithm is a popular choice for learning latent variable models. Variants of the EM have been initially introduced by Neal and Hinton (1998), using incremental updates to scale to large datasets, and by Wei and Tanner (1990); Delyon et al. (1999), using Monte Carlo (MC) approximations to bypass the intractable conditional expectation of the latent data for most nonconvex models. In this paper, we propose a general class of methods called Two-Timescale EM Methods based on a two-stage approach of stochastic updates to tackle an essential nonconvex optimization task for latent variable models. We motivate the choice of a double dynamic by invoking the variance reduction virtue of each stage of the method on both sources of noise: the index sampling for the incremental update and the MC approximation. We establish finite-time and global convergence bounds for nonconvex objective functions. Numerical applications on various models such as deformable template for image analysis or nonlinear models for pharmacokinetics are also presented to illustrate our findings.
LIGS: Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning
Mguni, David Henry, Jafferjee, Taher, Wang, Jianhong, Slumbers, Oliver, Perez-Nieves, Nicolas, Tong, Feifei, Yang, Li, Zhu, Jiangcheng, Yang, Yaodong, Wang, Jun
Efficient exploration is important for reinforcement learners to achieve high rewards. In multi-agent systems, coordinated exploration and behaviour is critical for agents to jointly achieve optimal outcomes. In this paper, we introduce a new general framework for improving coordination and performance of multi-agent reinforcement learners (MARL). Our framework, named Learnable Intrinsic-Reward Generation Selection algorithm (LIGS) introduces an adaptive learner, Generator that observes the agents and learns to construct intrinsic rewards online that coordinate the agents' joint exploration and joint behaviour. Using a novel combination of MARL and switching controls, LIGS determines the best states to learn to add intrinsic rewards which leads to a highly efficient learning process. LIGS can subdivide complex tasks making them easier to solve and enables systems of MARL agents to quickly solve environments with sparse rewards. LIGS can seamlessly adopt existing MARL algorithms and, our theory shows that it ensures convergence to policies that deliver higher system performance. We demonstrate its superior performance in challenging tasks in Foraging and StarCraft II.
Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model
Li, Jialian, Ren, Tongzheng, Yan, Dong, Su, Hang, Zhu, Jun
In high-stake scenarios like medical treatment and auto-piloting, it's risky or even infeasible to collect online experimental data to train the agent. Simulation-based training can alleviate this issue, but may suffer from its inherent mismatches from the simulator and real environment. It is therefore imperative to utilize the simulator to learn a robust policy for the real-world deployment. In this work, we consider policy learning for Robust Markov Decision Processes (RMDP), where the agent tries to seek a robust policy with respect to unexpected perturbations on the environments. Specifically, we focus on the setting where the training environment can be characterized as a generative model and a constrained perturbation can be added to the model during testing. Our goal is to identify a near-optimal robust policy for the perturbed testing environment, which introduces additional technical difficulties as we need to simultaneously estimate the training environment uncertainty from samples and find the worst-case perturbation for testing. To solve this issue, we propose a generic method which formalizes the perturbation as an opponent to obtain a two-player zero-sum game, and further show that the Nash Equilibrium corresponds to the robust policy. We prove that, with a polynomial number of samples from the generative model, our algorithm can find a near-optimal robust policy with a high probability. Our method is able to deal with general perturbations under some mild assumptions and can also be extended to more complex problems like robust partial observable Markov decision process, thanks to the game-theoretical formulation.
Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism
Yin, Ming, Duan, Yaqi, Wang, Mengdi, Wang, Yu-Xiang
Offline reinforcement learning, which seeks to utilize offline/historical data to optimize sequential decision-making strategies, has gained surging prominence in recent studies. Due to the advantage that appropriate function approximators can help mitigate the sample complexity burden in modern reinforcement learning problems, existing endeavors usually enforce powerful function representation models (e.g. neural networks) to learn the optimal policies. However, a precise understanding of the statistical limits with function representations, remains elusive, even when such a representation is linear. Towards this goal, we study the statistical limits of offline reinforcement learning with linear model representations. To derive the tight offline learning bound, we design the variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs). VAPVI leverages estimated variances of the value functions to reweight the Bellman residuals in the least-square pessimistic value iteration and provides improved offline learning bounds over the best-known existing results (whereas the Bellman residuals are equally weighted by design). More importantly, our learning bounds are expressed in terms of system quantities, which provide natural instance-dependent characterizations that previous results are short of. We hope our results draw a clearer picture of what offline learning should look like when linear representations are provided.
Markov Decision Processes - DataScienceCentral.com
The Markov Decision Process (MDP) provides a mathematical framework for solving the RL problem. Almost all RL problems can be modeled as an MDP. MDPs are widely used for solving various optimization problems. In this section, we will understand what an MDP is and how it is used in RL. To understand an MDP, first, we need to learn about the Markov property and Markov chain.
Scalable Online Planning for Multi-Agent MDPs
Choudhury, Shushman (Lacuna) | Gupta, Jayesh K. (Microsoft) | Morales, Peter (Microsoft) | Kochenderfer, Mykel J. (Stanford University)
We present a scalable tree search planning algorithm for large multi-agent sequential decision problems that require dynamic collaboration. Teams of agents need to coordinate decisions in many domains, but naive approaches fail due to the exponential growth of the joint action space with the number of agents. We circumvent this complexity through an approach that allows us to trade computation for approximation quality and dynamically coordinate actions. Our algorithm comprises three elements: online planning with Monte Carlo Tree Search (MCTS), factored representations of local agent interactions with coordination graphs, and the iterative Max-Plus method for joint action selection. We evaluate our approach on the benchmark SysAdmin domain with static coordination graphs and achieve comparable performance with much lower computation cost than our MCTS baselines. We also introduce a multi-drone delivery domain with dynamic coordination graphs, and demonstrate how our approach scales to large problems on this domain that are intractable for other MCTS methods.