Learning Graphical Models
214cfbe603b7f9f9bc005d5f53f7a1d3-Paper.pdf
In this paper, we investigate the question: Given a small number of datapoints, for example N = 30, how tight can PAC-Bayes and test set bounds be made? For such small datasets, test set bounds adversely affect generalisation performance by withholding data from the training procedure. In this setting, PAC-Bayes bounds are especially attractive, due to their ability to use all the data to simultaneouslylearn a posterior and bound its generalisation risk. We focus on the case of i.i.d.
Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP) using the data coming from a policy µ. In particular, we consider the sample complexity problems of offline RL for finite-horizon MDPs. Prior works study this problem based on different data-coverage assumptions, and their learning guarantees are expressed by the covering coefficients which lack the explicit characterization of system quantities.
Markov locality and relating it to p locality
To gain intuition for how p-locality functions, we will introduce another notion of locality, called Markov locality, which will use the language of Markov blankets. We will prove that under relatively relaxed conditions p-locality and Markov locality are equivalent. This will allow us to relate the notion of locality to various graph structures commonly used to represent probability distributions, and will be a key step in proving Properties 2.1 and 2.2. We start by defining the Markov boundary, M(X,S), of a random variable X contained in a set of random variables S, as a minimal set such that p(X|S) = p(X|M(X,S)). The Markov boundary defines a minimal set of variables such that, conditioned on these variables, conditioning on no additional random variables in S changes the probability of X [39]. Similarly, we define the Markov blanket, M(X,S) for X in S as any set of variables such that conditioning on M(X,S), makes X conditionally independent from all other variables [39]. In this way, the Markov boundary is a Markov blanket but not all blankets are boundaries. Markov locality: Given probability distribution p(Z) and function f: RNX+NΘ RNΘ, the update function f(Z) is Markov-local with respect to the distribution p over Z if and only if k: Z Ωs.t. AMarkov boundary can be thought of as the set of variables that'locally' communicate with the parameter Θk, thus providing a natural measure of locality. Importantly, for Markov-locality to be of use, we would like the Markov boundaries of random variables in the model of interest to be unique.
Model Adaptation: Historical Contrastive Learning for Unsupervised Domain Adaptation without Source Data Supplemental Materials Anonymous Author(s) Affiliation Address email
A.1 Proof of Proposition 12 Proposition 1 The historical contrastive instance discrimination (HCID) can be modelled as a3 maximum likelihood problem optimized via Expectation Maximization.4 Maximum likelihood (ML) is a concept to describe the theoretic insights of clustering algorithms.6 PN n=1 Z(kn) = 1), and the last step of derivation13 employs Jensen's inequality [6, 7, 4]. Z(kn) log p(xq,kn; θE) (5) Expectation step focuses on estimating the posterior probability p(kn; xq,θE). We first gener-17 ate keys by a historical encoder: kt mn = Et m(xt), and xt Xtgt. Then, We calculate18 p(kn; xq,θE) = p(kt mn; xq,θE) = 1 (xq,kt mn), where 1 (xq,kt mn) = 1 if both belong to the19 positive pair; otherwise, 1 (xq,kt mn) = 0.20 Please note the notation "t m" shows that the k is encoded by a historical encoder.21
Learning with little mixing
We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the leastsquares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the L2 and L2+ε norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional ℓ2(N)ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time.