suboptimality
- Asia > China > Hong Kong (0.04)
- North America > United States (0.04)
- Asia > China > Hong Kong (0.05)
- North America > United States (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > China > Hong Kong (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > China > Hong Kong (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
7 Appendix A
's vector (where the dimension is inferred from context). Recall from eq. (2), the population expected 0-1 loss of a policy π is defined as L( f N null (19) = min null µ, µ|S| H N null, (20) where the last inequality uses [22, Theorem 6.1]. This concludes the proof of Theorem 2. 7.3 Proof of Theorem 3 Theorem 10. In particular, the learner exactly knows the expert's policy The expert policy is deterministic in the lower bound instances we construct. First the expert's policy is sampled uniformly from Intuitively, the learner cannot guess the expert's action Since the bad state is never observed in the dataset, the learner is forced to guess the expert's action Using [22, Lemma A.14], the conditional distribution of the expert's policy given the expert dataset D can be characterized.
Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning
Current approaches to model-based offline reinforcement learning often incorporate uncertainty-based reward penalization to address the distributional shift problem. These approaches, commonly known as pessimistic value iteration, use Monte Carlo sampling to estimate the Bellman target to perform temporal difference-based policy evaluation. We find out that the randomness caused by this sampling step significantly delays convergence. We present a theoretical result demonstrating the strong dependency of suboptimality on the number of Monte Carlo samples taken per Bellman target calculation. Our main contribution is a deterministic approximation to the Bellman target that uses progressive moment matching, a method developed originally for deterministic variational inference. The resulting algorithm, which we call Moment Matching Offline Model-Based Policy Optimization (MOMBO), propagates the uncertainty of the next state through a nonlinear Q-network in a deterministic fashion by approximating the distributions of hidden layer activations by a normal distribution. We show that it is possible to provide tighter guarantees for the suboptimality of MOMBO than the existing Monte Carlo sampling approaches. We also observe MOMBO to converge faster than these approaches in a large set of benchmark tasks.
On the Value of Interaction and Function Approximation in Imitation Learning
We study the statistical guarantees for the Imitation Learning (IL) problem in episodic MDPs.Rajaraman et al. (2020) show an information theoretic lower bound that in the worst case, a learner which can even actively query the expert policy suffers from a suboptimality growing quadratically in the length of the horizon, $H$. We study imitation learning under the $\mu$-recoverability assumption of Ross et al. (2011) which assumes that the difference in the $Q$-value under the expert policy across different actions in a state do not deviate beyond $\mu$ from the maximum. We show that the reduction proposed by Ross et al. (2010) is statistically optimal: the resulting algorithm upon interacting with the MDP for $N$ episodes results in a suboptimality bound of $\widetilde{\mathcal{O}} \left( \mu |\mathcal{S}| H / N \right)$ which we show is optimal up to log-factors. In contrast, we show that any algorithm which does not interact with the MDP and uses an offline dataset of $N$ expert trajectories must incur suboptimality growing as $\gtrsim |\mathcal{S}| H^2/N$ even under the $\mu$-recoverability assumption. This establishes a clear and provable separation of the minimax rates between the active setting and the no-interaction setting.
Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences
Aolaritei, Liviu, Jordan, Michael I.
We study stopping rules for stochastic gradient descent (SGD) for convex optimization from the perspective of anytime-valid confidence sequences. Classical analyses of SGD provide convergence guarantees in expectation or at a fixed horizon, but offer no statistically valid way to assess, at an arbitrary time, how close the current iterate is to the optimum. We develop an anytime-valid, data-dependent upper confidence sequence for the weighted average suboptimality of projected SGD, constructed via nonnegative supermartingales and requiring no smoothness or strong convexity. This confidence sequence yields a simple stopping rule that is provably $\varepsilon$-optimal with probability at least $1-α$, with explicit bounds on the stopping time under standard stochastic approximation stepsizes. To the best of our knowledge, these are the first rigorous, time-uniform performance guarantees and finite-time $\varepsilon$-optimality certificates for projected SGD with general convex objectives, based solely on observable trajectory quantities.
- Asia > Middle East > Jordan (0.40)
- North America > United States > California > Alameda County > Berkeley (0.40)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Without-Replacement Sampling for Stochastic Gradient Methods
In contrast, sampling without replacement is far less understood, yet in practice it is very common, often easier to implement, and usually performs better. In this paper, we provide competitive convergence guarantees for without-replacement sampling under several scenarios, focusing on the natural regime of few passes over the data.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Middle East > Israel (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- North America > United States > Washington > King County > Bellevue (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.92)
- Marketing (0.34)
- Information Technology > Services (0.34)