Search
Hybrid Search for Efficient Planning with Completeness Guarantees
Solving complex planning problems has been a long-standing challenge in computer science. Learning-based subgoal search methods have shown promise in tackling these problems, but they often suffer from a lack of completeness guarantees, meaning that they may fail to find a solution even if one exists. In this paper, we propose an efficient approach to augment a subgoal search method to achieve completeness in discrete action spaces. Specifically, we augment the high-level search with low-level actions to execute a multi-level (hybrid) search, which we call complete subgoal search. This solution achieves the best of both worlds: the practical efficiency of high-level search and the completeness of low-level search. We apply the proposed search method to a recently proposed subgoal search algorithm and evaluate the algorithm trained on offline data on complex planning problems.
Scalable Neural Network Verification with Branch-and-bound Inferred Cutting Planes
Recently, cutting-plane methods such as GCP-CROWN have been explored to enhance neural network verifiers and made significant advancements. However, GCP-CROWN currently relies on {\it generic} cutting planes ("cuts") generated from external mixed integer programming (MIP) solvers. In this paper, we exploit the structure of the neural network verification problem to generate efficient and scalable cutting planes {\it specific} to this problem setting. We propose a novel approach, Branch-and-bound Inferred Cuts with COnstraint Strengthening (BICCOS), that leverages the logical relationships of neurons within verified subproblems in the branch-and-bound search tree, and we introduce cuts that preclude these relationships in other subproblems. We develop a mechanism that assigns influence scores to neurons in each path to allow the strengthening of these cuts.
Near-Optimal Distributed Minimax Optimization under the Second-Order Similarity
This paper considers the distributed convex-concave minimax optimization under the second-order similarity.We propose stochastic variance-reduced optimistic gradient sliding (SVOGS) method, which takes the advantage of the finite-sum structure in the objective by involving the mini-batch client sampling and variance reduction.We prove SVOGS can achieve the \varepsilon -duality gap within communication rounds of {\mathcal O}(\delta D 2/\varepsilon), communication complexity of {\mathcal O}(n \sqrt{n}\delta D 2/\varepsilon),and local gradient calls of \tilde{\mathcal O}(n (\sqrt{n}\delta L)D 2/\varepsilon\log(1/\varepsilon)), where n is the number of nodes, \delta is the degree of the second-order similarity, L is the smoothness parameter and D is the diameter of the constraint set.We can verify that all of above complexity (nearly) matches the corresponding lower bounds.For the specific \mu -strongly-convex- \mu -strongly-convex case, our algorithm has the upper bounds on communication rounds, communication complexity, and local gradient calls of \mathcal O(\delta/\mu\log(1/\varepsilon)), {\mathcal O}((n \sqrt{n}\delta/\mu)\log(1/\varepsilon)), and \tilde{\mathcal O}(n (\sqrt{n}\delta L)/\mu)\log(1/\varepsilon)) respectively, which are also nearly tight.Furthermore, we conduct the numerical experiments to show the empirical advantages of proposed method.
Achieving Tractable Minimax Optimal Regret in Average Reward MDPs
In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs).However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies.In this paper, we present the first *tractable* algorithm with minimax optimal regret of \mathrm{O}\left(\sqrt{\mathrm{sp}(h *) S A T \log(SAT)}\right) where \mathrm{sp}(h *) is the span of the optimal bias function h *, S\times A is the size of the state-action space and T the number of learning steps. Remarkably, our algorithm does not require prior information on \mathrm{sp}(h *) . Our algorithm relies on a novel subroutine, **P**rojected **M**itigated **E**xtended **V**alue **I**teration ( PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to obtain improved regret bounds.
First-Order Minimax Bilevel Optimization
Multi-block minimax bilevel optimization has been studied recently due to its great potential in multi-task learning, robust machine learning, and few-shot learning. However, due to the complex three-level optimization structure, existing algorithms often suffer from issues such as high computing costs due to the second-order model derivatives or high memory consumption in storing all blocks' parameters. In this paper, we tackle these challenges by proposing two novel fully first-order algorithms named FOSL and MemCS. FOSL features a fully single-loop structure by updating all three variables simultaneously, and MemCS is a memory-efficient double-loop algorithm with cold-start initialization. We provide a comprehensive convergence analysis for both algorithms under full and partial block participation, and show that their sample complexities match or outperform those of the same type of methods in standard bilevel optimization.
Corruption-Robust Linear Bandits: Minimax Optimality and Gap-Dependent Misspecification
In linear bandits, how can a learner effectively learn when facing corrupted rewards? While significant work has explored this question, a holistic understanding across different adversarial models and corruption measures is lacking, as is a full characterization of the minimax regret bounds. In this work, we compare two types of corruptions commonly considered: strong corruption, where the corruption level depends on the learner's chosen action, and weak corruption, where the corruption level does not depend on the learner's chosen action. We provide a unified framework to analyze these corruptions. For stochastic linear bandits, we fully characterize the gap between the minimax regret under strong and weak corruptions.
Achieving Near-Optimal Convergence for Distributed Minimax Optimization with Adaptive Stepsizes
In this paper, we show that applying adaptive methods directly to distributed minimax problems can result in non-convergence due to inconsistency in locally computed adaptive stepsizes. To address this challenge, we propose D-AdaST, a Distributed Adaptive minimax method with Stepsize Tracking. The key strategy is to employ an adaptive stepsize tracking protocol involving the transmission of two extra (scalar) variables. This protocol ensures the consistency among stepsizes of nodes, eliminating the steady-state error due to the lack of coordination of stepsizes among nodes that commonly exists in vanilla distributed adaptive methods, and thus guarantees exact convergence. For nonconvex-strongly-concave distributed minimax problems, we characterize the specific transient times that ensure time-scale separation of stepsizes and quasi-independence of networks, leading to a near-optimal convergence rate of \tilde{\mathcal{O}} \left( \epsilon {-\left( 4 \delta \right)} \right) for any small \delta 0, matching that of the centralized counterpart.
Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage
We consider offline reinforcement learning (RL) where we only have only access to offline data. In contrast to numerous offline RL algorithms that necessitate the uniform coverage of the offline data over state and action space, we propose value-based algorithms with PAC guarantees under partial coverage, specifically, coverage of offline data against a single policy, and realizability of soft Q-function (a.k.a., entropy-regularized Q-function) and another function, which is defined as a solution to a saddle point of certain minimax optimization problem). Furthermore, we show the analogous result for Q-functions instead of soft Q-functions. To attain these guarantees, we use novel algorithms with minimax loss functions to accurately estimate soft Q-functions and Q-functions with -convergence guarantees measured on the offline data. We introduce these loss functions by casting the estimation problems into nonlinear convex optimization problems and taking the Lagrange functions.
Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret
We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to induce an optimistic SSP problem whose associated value iteration scheme is guaranteed to converge. We prove that EB-SSP achieves the minimax regret rate \widetilde{O}(B_{\star} \sqrt{S A K}), where K is the number of episodes, S is the number of states, A is the number of actions and B_{\star} bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of B_{\star}, nor of T_{\star}, which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of T_{\star} is available) where the regret only contains a logarithmic dependence on T_{\star}, thus yielding the first (nearly) horizon-free regret bound beyond the finite-horizon MDP setting.
Sequential Probability Assignment with Contexts: Minimax Regret, Contextual Shtarkov Sums, and Contextual Normalized Maximum Likelihood
We study the fundamental problem of sequential probability assignment, also known as online learning with logarithmic loss, with respect to an arbitrary, possibly nonparametric hypothesis class. Our goal is to obtain a complexity measure for the hypothesis class that characterizes the minimax regret and to determine a general, minimax optimal algorithm. Notably, the sequential \ell_{\infty} entropy, extensively studied in the literature (Rakhlin and Sridharan, 2015, Bilodeau et al., 2020, Wu et al., 2023), was shown to not characterize minimax regret in general. Inspired by the seminal work of Shtarkov (1987) and Rakhlin, Sridharan, and Tewari (2010), we introduce a novel complexity measure, the \emph{contextual Shtarkov sum}, corresponding to the Shtarkov sum after projection onto a multiary context tree, and show that the worst case log contextual Shtarkov sum equals the minimax regret. Using the contextual Shtarkov sum, we derive the minimax optimal strategy, dubbed \emph{contextual Normalized Maximum Likelihood} (cNML).