tsde
- North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
Scalar Posterior Sampling with Applications
Georgios Theocharous, Zheng Wen, Yasin Abbasi Yadkori, Nikos Vlassis
Our algorithm termed deterministic schedule PSRL (DS-PSRL) is efficient in terms of time, sample, and space complexity. We prove a Bayesian regret bound under mild assumptions. Our result is more generally applicable to multiple parameters and continuous state action problems. We compare our algorithm with state-of-the-art PSRL algorithms on standard discrete and continuous problems from the literature.
- North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
- North America > Canada > Quebec > Montreal (0.04)
Reviews: Learning Unknown Markov Decision Processes: A Thompson Sampling Approach
The paper proposes TSDE, a posterior sampling algorithm for RL in the average reward infinite horizon setting. This algorithm uses dynamic episodes but unlike Lazy-PSRL avoids technical issues by not only terminating an episode when an observation count doubled but also terminating episodes when they become too long. This ensures that the episode length cannot grow faster than linear and ultimately a Bayesian regret bound of O(HS(AT) .5) is shown. Posterior sampling methods typically outperform UCB-type algorithms and therefore a posterior sampling algorithm for non-episodic RL with rigorous regret bounds is desirable. This paper proposes such an algorithm, which is of high interest.
Learning Unknown Markov Decision Processes: A Thompson Sampling Approach
Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, Rahul Jain
We consider the problem of learning an unknown Markov Decision Process (MDP) that is weakly communicating in the infinite horizon setting. We propose a Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE). At the beginning of each episode, the algorithm generates a sample from the posterior distribution over the unknown model parameters. It then follows the optimal stationary policy for the sampled model for the rest of the episode. The duration of each episode is dynamically determined by two stopping criteria.
- North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
Self-Supervised Learning of Time Series Representation via Diffusion Process and Imputation-Interpolation-Forecasting Mask
Senane, Zineb, Cao, Lele, Buchner, Valentin Leonhard, Tashiro, Yusuke, You, Lei, Herman, Pawel, Nordahl, Mats, Tu, Ruibo, von Ehrenheim, Vilhelm
Time Series Representation Learning (TSRL) focuses on generating informative representations for various Time Series (TS) modeling tasks. Traditional Self-Supervised Learning (SSL) methods in TSRL fall into four main categories: reconstructive, adversarial, contrastive, and predictive, each with a common challenge of sensitivity to noise and intricate data nuances. Recently, diffusion-based methods have shown advanced generative capabilities. However, they primarily target specific application scenarios like imputation and forecasting, leaving a gap in leveraging diffusion models for generic TSRL. Our work, Time Series Diffusion Embedding (TSDE), bridges this gap as the first diffusion-based SSL TSRL approach. TSDE segments TS data into observed and masked parts using an Imputation-Interpolation-Forecasting (IIF) mask. It applies a trainable embedding function, featuring dual-orthogonal Transformer encoders with a crossover mechanism, to the observed part. We train a reverse diffusion process conditioned on the embeddings, designed to predict noise added to the masked part. Extensive experiments demonstrate TSDE's superiority in imputation, interpolation, forecasting, anomaly detection, classification, and clustering. We also conduct an ablation study, present embedding visualizations, and compare inference speed, further substantiating TSDE's efficiency and validity in learning representations of TS data.
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.45)
A relaxed technical assumption for posterior sampling-based reinforcement learning for control of unknown linear systems
Gagrani, Mukul, Sudhakara, Sagar, Mahajan, Aditya, Nayyar, Ashutosh, Ouyang, Yi
We revisit the Thompson sampling algorithm to control an unknown linear quadratic (LQ) system recently proposed by Ouyang et al (arXiv:1709.04047). The regret bound of the algorithm was derived under a technical assumption on the induced norm of the closed loop system. In this technical note, we show that by making a minor modification in the algorithm (in particular, ensuring that an episode does not end too soon), this technical assumption on the induced norm can be replaced by a milder assumption in terms of the spectral radius of the closed loop system. The modified algorithm has the same Bayesian regret of $\tilde{\mathcal{O}}(\sqrt{T})$, where $T$ is the time-horizon and the $\tilde{\mathcal{O}}(\cdot)$ notation hides logarithmic terms in~$T$.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > New York (0.04)
- (2 more...)
Thompson Sampling in Non-Episodic Restless Bandits
Jung, Young Hun, Abeille, Marc, Tewari, Ambuj
Restless bandit problems assume time-varying reward distributions of the arms, which adds flexibility to the model but makes the analysis more challenging. We study learning algorithms over the unknown reward distributions and prove a sub-linear, $O(\sqrt{T}\log T)$, regret bound for a variant of Thompson sampling. Our analysis applies in the infinite time horizon setting, resolving the open question raised by Jung and Tewari (2019) whose analysis is limited to the episodic case. We adopt their policy mapping framework, which allows our algorithm to be efficient and simultaneously keeps the regret meaningful. Our algorithm adapts the TSDE algorithm of Ouyang et al. (2017) in a non-trivial manner to account for the special structure of restless bandits. We test our algorithm on a simulated dynamic channel access problem with several policy mappings, and the empirical regrets agree with the theoretical bound regardless of the choice of the policy mapping.
- North America > United States > Michigan (0.04)
- North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Learning Unknown Markov Decision Processes: A Thompson Sampling Approach
Ouyang, Yi, Gagrani, Mukul, Nayyar, Ashutosh, Jain, Rahul
We consider the problem of learning an unknown Markov Decision Process (MDP) that is weakly communicating in the infinite horizon setting. We propose a Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE). At the beginning of each episode, the algorithm generates a sample from the posterior distribution over the unknown model parameters. It then follows the optimal stationary policy for the sampled model for the rest of the episode. The duration of each episode is dynamically determined by two stopping criteria. The first stopping criterion controls the growth rate of episode length. The second stopping criterion happens when the number of visits to any state-action pair is doubled. We establish $\tilde O(HS\sqrt{AT})$ bounds on expected regret under a Bayesian setting, where $S$ and $A$ are the sizes of the state and action spaces, $T$ is time, and $H$ is the bound of the span. This regret bound matches the best available bound for weakly communicating MDPs. Numerical results show it to perform better than existing algorithms for infinite horizon MDPs.
- North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)