A Further related works
–Neural Information Processing Systems
We now take a moment to discuss a small sample of other related works. Tsitsiklis, 1994; Jaakkola et al., 1994; Szepesvári, 1997), which enjoys a space complexity Finite-time guarantees of other variants of Q-learning have also been developed; partial examples include speedy Q-learning (Azar et al., 2011), double A common theme is to augment the original model-free update rule (e.g., the Q-learning update rule) by an exploration bonus, which typically takes the form of, say, certain upper confidence bounds (UCBs) motivated by the bandit literature (Lai and Robbins, 1985; Auer and Ortner, 2010). Model-based RL is known to be minimax-optimal in the presence of a simulator (Azar et al., 2013; Agarwal et al., 2020; Li et al., 2020a), beating the state-of-the-art model-free algorithms by achieving optimality for the entire sample size range (Li et al., 2020a). When it comes to online episodic RL, Azar et al. (2017) was the first work that managed to achieve The way to construct hard MDPs in Jaksch et al. (2010) has since been adapted by Jin et al. (2018) to exhibit a lower bound on episodic MDPs (with a sketched proof provided therein).
Neural Information Processing Systems
Aug-16-2025, 02:29:46 GMT