Goto

Collaborating Authors

 Markov Models


TD(0) Learning converges for Polynomial mixing and non-linear functions

arXiv.org Machine Learning

Theoretical work on Temporal Difference (TD) learning has provided finite-sample and high-probability guarantees for data generated from Markov chains. However, these bounds typically require linear function approximation, instance-dependent step sizes, algorithmic modifications, and restrictive mixing rates. We present theoretical findings for TD learning under more applicable assumptions, including instance-independent step sizes, full data utilization, and polynomial ergodicity, applicable to both linear and non-linear functions. \textbf{To our knowledge, this is the first proof of TD(0) convergence on Markov data under universal and instance-independent step sizes.} While each contribution is significant on its own, their combination allows these bounds to be effectively utilized in practical application settings. Our results include bounds for linear models and non-linear under generalized gradients and H\"older continuity.


TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Although Deep Reinforcement Learning (DRL) and Large Language Models (LLMs) each show promise in addressing decision-making challenges in autonomous driving, DRL often suffers from high sample complexity, while LLMs have difficulty ensuring real-time decision making. To address these limitations, we propose TeLL-Drive, a hybrid framework that integrates an Teacher LLM to guide an attention-based Student DRL policy. By incorporating risk metrics, historical scenario retrieval, and domain heuristics into context-rich prompts, the LLM produces high-level driving strategies through chain-of-thought reasoning. A self-attention mechanism then fuses these strategies with the DRL agent's exploration, accelerating policy convergence and boosting robustness across diverse driving conditions. Our experimental results, evaluated across multiple traffic scenarios, show that TeLL-Drive outperforms existing baseline methods, including other LLM-based approaches, in terms of success rates, average returns, and real-time feasibility. Ablation studies underscore the importance of each model component, especially the synergy between the attention mechanism and LLM-driven guidance. These findings suggest that TeLL-Drive significantly enhances both the adaptability and safety of autonomous driving systems, while offering a more efficient and scalable approach for policy learning. Full validation results are available on our website.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

Thanks to all 6 reviewers for your helpful comments, we try to address all points raised in what follows: To Reviewer 1, * thank you for the positive feedback, as for your last comment regarding theoretical guarantees, this is something we are currently looking into considering, among other things, "convergent EP" of Heskes and Zoeter, 2002. To Reviewer 2, * re: "better explanation of how to find the proposal" the proposal at each node at a given step is the current approximation of the belief on that node. I.e.: the product of the approximated messages coming into that node, this is explained in line 196-209 (page 4), in particular equation 14 gives the form of the proposals. This is one of the main point of our submission i.e., suggesting to construct a proposal on the go using EP. We could try to make sure this point is clearer.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

A nice advantage of predictive representations of stochastic processes is that they can be expressed in terms of families of linear operators --- the "observable operators" of Jaeger (oddly, not cited in this paper; also, see Upper, and the appendix to Shalizi and Crutchfield). This paper proposes (following some earlier work) to exploit this fact, by using the instrumental variables technique from econometrics to simplify the estimation of such models. Doing so results in an estimation procedure very similar to that of Langford et al. from 2009 (reference [16] in the paper), but with some advantages in terms of avoiding iterative re-estimation. However, there seems to be an important issue which isn't (that I saw) addressed here. The instrumental variable needs to be correlated with the input variable to the regression, but independent of the noise in the regression.


Review for NeurIPS paper: Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes

Neural Information Processing Systems

Additional Feedback: Response to author feedback: From the informal discussion about the cross-component counters, I'm getting that it's somehow bad if different components have been explored unevenly and therefore encouraging more balanced exploration (pairwise) reduces overall variance in the amount of exploration between components. I'm sure there's a lot I'm not getting, but that helps a bit. I think it should be the case that you recover an object when you multiply its factors together (for the appropriate definition of "multiply"). There are papers (well, just one I can think of) that deal with truly factored MDPs that are the product of simpler MDPs. They correctly call their MDPs factored.


Review for NeurIPS paper: Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes

Neural Information Processing Systems

While this paper initially had some mild divergence of opinion among the reviewers, after the author response and some detailed discussion, it was agreed that this paper makes a solid contribution (please see the revised reviews). It is certainly is of relevance to NeuRIPS. After discussion, there was agreement on the significance of the conceptual contribution, namely the treatment of the cross-component bonuses. Several reviewers note that the mathematics is fairly "standard" (Bernstein-bound machinery), though in the end that should not be considered a drawback. At least one reviewer notes that the 31pp appendix means that it is not possible to verify the mathematical results during the review period.


Review for NeurIPS paper: Model-based Reinforcement Learning for Semi-Markov Decision Processes with Neural ODEs

Neural Information Processing Systems

Summary and Contributions: The paper proposes a method for utilizing ODEs to represent dynamics for continuous-time decision-making problems with the aim of They also target filling a perceived gap in the literature of Deep RL for continuous-time problems, where most publications are model-free and discretize time if it is continuous. They claim that their approach leads to lower dependence on vast amounts of training data, better performance and that the model-based approach is well-founded. I tend to agree, although this is not exactly my area. I also believe the importance of connecting ODEs and other explicit models is critical for extending RL methods to important problems in physics, chemistry, epidemiology and population modelling.



Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

SUMMARY Hamiltonian MCMC methods sample from a probability distribution by treating its log as a "potential energy" function over the state space, augmenting the space with extra "momentum variables" and their associated "kinetic energy", and evolving the state of the Markov process by integrating the physical Hamiltonian equations of motion of the system. Each step of the Markov chain is accomplished by numerically integrating the Hamiltonian equations forward in time. However, if the energy function is non-differentiable, the integral is not well-defined. The rejection step that is used to counteract numerical inaccuracies in the integration also accounts for such non-differentiable regions, but at the cost of slowing down the mixing rate of the Markov chain. This paper suggests physically-inspired "reflections" and "refractions" of the trajectory of the system that occur whenever the state crosses a discontinuity in the energy function. It applies to target distributions that are differentiable everywhere except on the boundaries of certain polytopes; the reflection or refraction occurs whenever the trajectory of the system crosses such a boundary.


Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path

Neural Information Processing Systems

This article provides the first procedure for computing a fully data-dependent interval that traps the mixing time t_{mix} of a finite reversible ergodic Markov chain at a prescribed confidence level. The interval is computed from a single finite-length sample path from the Markov chain, and does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge. The interval is constructed around the relaxation time t_{relax}, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a \sqrt{n} rate, where n is the length of the sample path. Upper and lower bounds are given on the number of samples required to achieve constant-factor multiplicative accuracy.