# Markov Models

### On Connections between Constrained Optimization and Reinforcement Learning

Dynamic Programming (DP) provides standard algorithms to solve Markov Decision Processes. However, these algorithms generally do not optimize a scalar objective function. In this paper, we draw connections between DP and (constrained) convex optimization. Specifically, we show clear links in the algorithmic structure between three DP schemes and optimization algorithms. We link Conservative Policy Iteration to Frank-Wolfe, Mirror-Descent Modified Policy Iteration to Mirror Descent, and Politex (Policy Iteration Using Expert Prediction) to Dual Averaging. These abstract DP schemes are representative of a number of (deep) Reinforcement Learning (RL) algorithms. By highlighting these connections (most of which have been noticed earlier, but in a scattered way), we would like to encourage further studies linking RL and convex optimization, that could lead to the design of new, more efficient, and better understood RL algorithms.

### 11 Alternatives To Keras For Deep Learning Enthusiasts

Infer.NET is a machine learning framework for running Bayesian inference in graphical models. It provides state-of-the-art message-passing algorithms and statistical routines needed to perform inference for a wide variety of applications. There are various intuitive features in this framework such as rich modelling language, multiple inference algorithms, designed for large scale inference as well as user-extendable. With the help of this framework, various Bayesian models such as Bayes Point Machine classifiers, TrueSkill matchmaking, hidden Markov models, and Bayesian networks can be implemented with ease.

### Simple Strategies in Multi-Objective MDPs (Technical Report)

We consider the verification of multiple expected reward objectives at once on Markov decision processes (MDPs). This enables a trade-off analysis among multiple objectives by obtaining the Pareto front. We focus on strategies that are easy to employ and implement. That is, strategies that are pure (no randomization) and have bounded memory. We show that checking whether a point is achievable by a pure stationary strategy is NP-complete, even for two objectives, and we provide an MILP encoding to solve the corresponding problem. The bounded memory case can be reduced to the stationary one by a product construction. Experimental results using \Storm and Gurobi show the feasibility of our algorithms.

### Restless Hidden Markov Bandits with Linear Rewards

This paper presents an algorithm and regret analysis for the restless hidden Markov bandit problem with linear rewards. In this problem the reward received by the decision maker is a random linear function which depends on the arm selected and a hidden state. In contrast to previous works on Markovian bandits, we do not assume that the decision maker receives information regarding the state of the system, but has to infer it based on its actions and the received reward. Surprisingly, we can still maintain logarithmic regret in the case of polyhedral action set. Furthermore, the regret does not depend on the number of extreme points in the action space.

### Optimal Immunization Policy Using Dynamic Programming

Decisions in public health are almost always made in the context of uncertainty. Policy makers responsible for making important decisions are faced with the daunting task of choosing from many possible options. This task is called planning under uncertainty, and is particularly acute when addressing complex systems, such as issues of global health and development. Decision making under uncertainty is a challenging task, and all too often this uncertainty is averaged away to simplify results for policy makers. A popular way to approach this task is to formulate the problem at hand as a (partially observable) Markov decision process, (PO)MDP. This work aims to apply these AI efforts to challenging problems in health and development. In this paper, we developed a framework for optimal health policy design in a dynamic setting. We apply a stochastic dynamic programing approach to identify both the optimal time to change the health intervention policy and the optimal time to collect decision relevant information.

### Multi Label Restricted Boltzmann Machine for Non-Intrusive Load Monitoring

Increasing population indicates that energy demands need to be managed in the residential sector. Prior studies have reflected that the customers tend to reduce a significant amount of energy consumption if they are provided with appliance-level feedback. This observation has increased the relevance of load monitoring in today's tech-savvy world. Most of the previously proposed solutions claim to perform load monitoring without intrusion, but they are not completely non-intrusive. These methods require historical appliance-level data for training the model for each of the devices. This data is gathered by putting a sensor on each of the appliances present in the home which causes intrusion in the building. Some recent studies have proposed that if we frame Non-Intrusive Load Monitoring (NILM) as a multi-label classification problem, the need for appliance-level data can be avoided. In this paper, we propose Multi-label Restricted Boltzmann Machine(ML-RBM) for NILM and report an experimental evaluation of proposed and state-of-the-art techniques.

### Audio-Conditioned U-Net for Position Estimation in Full Sheet Images

The goal of score following is to track a musical performance, usually in the form of audio, in a corresponding score representation. Established methods mainly rely on computer-readable scores in the form of MIDI or MusicXML and achieve robust and reliable tracking results. Recently, multimodal deep learning methods have been used to follow along musical performances in raw sheet images. Among the current limits of these systems is that they require a non trivial amount of preprocessing steps that unravel the raw sheet image into a single long system of staves. The current work is an attempt at removing this particular limitation. We propose an architecture capable of estimating matching score positions directly within entire unprocessed sheet images. We argue that this is a necessary first step towards a fully integrated score following system that does not rely on any preprocessing steps such as optical music recognition.

### Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

Model-free reinforcement learning is known to be memory and computation efficient and more amendable to large scale problems. In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov Decision Processes (MDPs). The first algorithm reduces the problem to the discounted-reward version and achieves $\mathcal{O}(T^{2/3})$ regret after $T$ steps, under the minimal assumption of weakly communicating MDPs. The second algorithm makes use of recent advances in adaptive algorithms for adversarial multi-armed bandits and improves the regret to $\mathcal{O}(\sqrt{T})$, albeit with a stronger ergodic assumption. To the best of our knowledge, these are the first model-free algorithms with sub-linear regret (that is polynomial in all parameters) in the infinite-horizon average-reward setting.

### Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling

Due in part to the growing sources of data about past sequences of decisions and their outcomes - from marketing to energy management to healthcare - there is increasing interest in developing accurate and efficient algorithms for off-policy policy evaluation. For Markov Decision Processes, this problem was addressed (Precup et al., 2000) early on by importance sampling (IS)(Rubinstein, 1981), a method prone to large variance due to rare events (Glynn, 1994; L'Ecuyer et al., 2009). The per-decision importance sampling estimator of Precup et al. (2000) tries to mitigate this problem by leveraging the temporal structure - earlier rewards cannot depend on later decisions - of the domain. While neither importance sampling (IS) nor per-decision IS (PDIS) assumes the underlying domain is Markov, more recently, a new class of estimators (Hallak and Mannor, 2017; Liu et al., 2018; Gelada and Bellemare, 2019) has been proposed that leverages the Markovian structure. In particular, these approaches propose performing importance sampling over the stationary state-action distributions induced by the corresponding Markov chain for a particular policy. By avoiding the explicit accumulation of likelihood ratios along the trajectories, it is hypothesized that such ratios of stationary distributions could substantially reduce the variance of the resulting estimator, thereby overcoming the "curse of horizon" (Liu et al., 2018) plaguing off-policy evaluation. The recent flurry of empirical results shows significant performance improvements over the alternative methods on a variety of simulation domains. Yet so far there has not been a formal analysis of the accuracy of IS, PDIS, and stationary state-action IS which will strengthen our understanding of their properties, benefits and limitations.

### Hierarchical Hidden Markov Jump Processes for Cancer Screening Modeling

Hierarchical Hidden Markov Jump Processes for Cancer Screening Modeling Rui Meng Soper Braden Jan Nygard, Mari Nygrad Herbert Lee UCSC LLNL Cancer Registry of Norway UCSC Abstract Hidden Markov jump processes are an attractive approach for modeling clinical disease progression data because they are explainable and capable of handling both irregularly sampled and noisy data. Most applications in this context consider time-homogeneous models due to their relative computational simplicity. However, the time homogeneous assumption is too strong to accurately model the natural history of many diseases. Moreover, the population at risk is not homogeneous either, since disease exposure and susceptibility can vary considerably. In this paper, we propose a piece-wise stationary transition matrix to explain the heterogeneity in time. We propose a hierarchical structure for the heterogeneity in population, where prior information is considered to deal with unbalanced data. Moreover, an efficient, scalable EM algorithm is proposed for inference. We demonstrate the feasibility and superiority of our model on a cervical cancer screening dataset from the Cancer Registry of Norway. Experiments show that our model outperforms state-of-the-art recurrent neural network models in terms of prediction accuracy and significantly outperforms a standard hidden Markov jump process in generating Kaplan-Meier estimators. 1 Introduction Population-based screening programs for identifying undiagnosed individuals have a long history in improving public health. Examples include screening pro-Preliminary work.