Szpruch, Lukasz, Treetanthiploet, Tanut, Zhang, Yufei

We develop a probabilistic framework for analysing model-based reinforcement learning in the episodic setting. We then apply it to study finite-time horizon stochastic control problems with linear dynamics but unknown coefficients and convex, but possibly irregular, objective function. Using probabilistic representations, we study regularity of the associated cost functions and establish precise estimates for the performance gap between applying optimal feedback control derived from estimated and true model parameters. We identify conditions under which this performance gap is quadratic, improving the linear performance gap in recent work [X. Guo, A. Hu, and Y. Zhang, arXiv preprint, arXiv:2104.09311, (2021)], which matches the results obtained for stochastic linear-quadratic problems. Next, we propose a phase-based learning algorithm for which we show how to optimise exploration-exploitation trade-off and achieve sublinear regrets in high probability and expectation. When assumptions needed for the quadratic performance gap hold, the algorithm achieves an order $\mathcal{O}(\sqrt{N} \ln N)$ high probability regret, in the general case, and an order $\mathcal{O}((\ln N)^2)$ expected regret, in self-exploration case, over $N$ episodes, matching the best possible results from the literature. The analysis requires novel concentration inequalities for correlated continuous-time observations, which we derive.

This note displays an interesting phenomenon for percentiles of independent but non-identical random variables. Let $X_1,\cdots,X_n$ be independent random variables obeying non-identical continuous distributions and $X^{(1)}\geq \cdots\geq X^{(n)}$ be the corresponding order statistics. For any $p\in(0,1)$, we investigate the $100(1-p)$%-th percentile $X^{(pn)}$ and prove non-asymptotic bounds for $X^{(pn)}$. In particular, for a wide class of distributions, we discover an intriguing connection between their median and the harmonic mean of the associated standard deviations. For example, if $X_k\sim\mathcal{N}(0,\sigma_k^2)$ for $k=1,\cdots,n$ and $p=\frac{1}{2}$, we show that its median $\big|{\rm Med}\big(X_1,\cdots,X_n\big)\big|= O_P\Big(n^{1/2}\cdot\big(\sum_{k=1}^n\sigma_k^{-1}\big)^{-1}\Big)$ as long as $\{\sigma_k\}_{k=1}^n$ satisfy certain mild non-dispersion property.

Maurer, Andreas, Pontil, Massimiliano

The popular bounded difference inequality [11] has become a standard tool in the analysis of algorithms. It bounds the deviation probability of a function of independent random variables from its mean in terms of the sum of conditional ranges, and may not be applied when these ranges are infinite. This hampers the utility of the inequality in certain situations. It may happen that the conditional ranges are infinite, but the conditional versions, the random variables obtained by fixing all but one of the arguments of the function, have light tails with exponential decay. In this case we might still expect exponential concentration, but the bounded difference inequality is of no help.

Calliess, Jan-Peter, Papachristodoulou, Antonis, Roberts, Stephen J.

This work proposes a new method for simultaneous probabilistic identification and control of an observable, fully-actuated mechanical system. Identification is achieved by conditioning stochastic process priors on observations of configurations and noisy estimates of configuration derivatives. In contrast to previous work that has used stochastic processes for identification, we leverage the structural knowledge afforded by Lagrangian mechanics and learn the drift and control input matrix functions of the control-affine system separately. We utilise feedback-linearisation to reduce, in expectation, the uncertain nonlinear control problem to one that is easy to regulate in a desired manner. Thereby, our method combines the flexibility of nonparametric Bayesian learning with epistemological guarantees on the expected closed-loop trajectory. We illustrate our method in the context of torque-actuated pendula where the dynamics are learned with a combination of normal and log-normal processes.

The paper deals with conditional linear information inequalities valid for entropy functions induced by discrete random variables. Specifically, the so-called conditional Ingleton inequalities are in the center of interest: these are valid under conditional independence assumptions on the inducing random variables. We discuss five inequalities of this particular type, four of which has appeared earlier in the literature. Besides the proof of the new fifth inequality, simpler proofs of (some of) former inequalities are presented. These five information inequalities are used to characterize all conditional independence structures induced by four discrete random variables.