Goto

Collaborating Authors

 kumar






Behavior Transformers: Cloningkmodeswithonestone

Neural Information Processing Systems

Infact, modelingmulti-modal 3 k means Continuous action dataset (|A| x a)Clustering into k bins Action offset (1 x a)Continuous action (1 x a)Categorical action bin (1 x k)Continuous action (1 x a)k means encoderk means decoderA.




26657d5ff9020d2abefe558796b99584-Paper.pdf

Neural Information Processing Systems

Specifically, there now exists a tight relaxation for verifying therobustness ofaneural networkto` input perturbations, aswell asefficient primal and dual solvers for the relaxation. Buoyed by this success, we consider the problem of developing similar techniques for verifying robustness to input perturbations within the probability simplex. We prove a somewhat surprising result that,inthiscase, notonlycanonedesign atightrelaxation thatovercomes the convexbarrier,butthe size ofthe relaxation remains linear inthe number of neurons, thereby leading tosimpler and more efficient algorithms.


OfflineReinforcementLearningasOneBig SequenceModelingProblem

Neural Information Processing Systems

Reinforcement learning (RL) is typically concerned with estimating stationary policies orsingle-step models, leveraging theMarkovproperty tofactorize problems in time. However, we can also view RL as a generic sequence modeling problem, with the goal being to produce a sequence of actions that leads to a sequence ofhighrewards.


Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

Kumar, Navdeep, Dahan, Tehila, Cohen, Lior, Barua, Ananyabrata, Ramponi, Giorgia, Levy, Kfir Yehuda, Mannor, Shie

arXiv.org Machine Learning

We establish an optimal sample complexity of $O(ε^{-2})$ for obtaining an $ε$-optimal global policy using a single-timescale actor-critic (AC) algorithm in infinite-horizon discounted Markov decision processes (MDPs) with finite state-action spaces, improving upon the prior state of the art of $O(ε^{-3})$. Our approach applies STORM (STOchastic Recursive Momentum) to reduce variance in the critic updates. However, because samples are drawn from a nonstationary occupancy measure induced by the evolving policy, variance reduction via STORM alone is insufficient. To address this challenge, we maintain a buffer of small fraction of recent samples and uniformly sample from it for each critic update. Importantly, these mechanisms are compatible with existing deep learning architectures and require only minor modifications, without compromising practical applicability.