Goto

Collaborating Authors

 markovian



Markovian with Christian Columbia chr Columbia d

Neural Information Processing Systems

Output: K ?, K ?. 1 for k=1,...,K do 2 Samplez[k] M( |z[k 1]; k 1, k 1) 3 Computes(z[k]; k 1)= r logq(z[k]; k 1) 4 Compute bgML( k 1)= r logp(z[k],x; k 1) 5 Set k= k 1+"ks(z[k]; k 1) 6 Set k= k 1+ kbgML( k 1) 7 end F hood (this obtained or WecompareMSCwith SMC-based [22] using [29].






ImprovingSampleComplexityBoundsfor(Natural) Actor-CriticAlgorithms

Neural Information Processing Systems

The goal of reinforcement learning (RL) [39] is to maximize the expected total reward by taking actions according toapolicyinastochastic environment, whichismodelled asaMarkovdecision process (MDP) [4]. To obtain an optimal policy, one popular method is the direct maximization of the expected total reward via gradient ascent, which is referred to as the policy gradient (PG) method [40,47].


Persuading Farsighted Receivers in MDPs: the Power of Honesty

Neural Information Processing Systems

Bayesian persuasion studies the problem faced by an informed sender who strategically discloses information to influence the behavior of an uninformed receiver. Recently, a growing attention has been devoted to settings where the sender and the receiver interact sequentially, in which the receiver's decision-making problem is usually modeled as a Markov decision process (MDP). However, the literature focuses on computing optimal information-revelation policies (a.k.a.


Accelerated Distributional Temporal Difference Learning with Linear Function Approximation

Jin, Kaicheng, Peng, Yang, Yang, Jiansheng, Zhang, Zhihua

arXiv.org Machine Learning

In this paper, we study the finite-sample statistical rates of distributional temporal difference (TD) learning with linear function approximation. The purpose of distributional TD learning is to estimate the return distribution of a discounted Markov decision process for a given policy. Previous works on statistical analysis of distributional TD learning focus mainly on the tabular case. We first consider the linear function approximation setting and conduct a fine-grained analysis of the linear-categorical Bellman equation. Building on this analysis, we further incorporate variance reduction techniques in our new algorithms to establish tight sample complexity bounds independent of the support size $K$ when $K$ is large. Our theoretical results imply that, when employing distributional TD learning with linear function approximation, learning the full distribution of the return function from streaming data is no more difficult than learning its expectation. This work provide new insights into the statistical efficiency of distributional reinforcement learning algorithms.


Finite-time Convergence Analysis of Actor-Critic with Evolving Reward

Hu, Rui, Chen, Yu, Huang, Longbo

arXiv.org Artificial Intelligence

Many popular practical reinforcement learning (RL) algorithms employ evolving reward functions-through techniques such as reward shaping, entropy regularization, or curriculum learning-yet their theoretical foundations remain underdeveloped. This paper provides the first finite-time convergence analysis of a single-timescale actor-critic algorithm in the presence of an evolving reward function under Markovian sampling. We consider a setting where the reward parameters may change at each time step, affecting both policy optimization and value estimation. Under standard assumptions, we derive non-asymptotic bounds for both actor and critic errors. Our result shows that an $O(1/\sqrt{T})$ convergence rate is achievable, matching the best-known rate for static rewards, provided the reward parameters evolve slowly enough. This rate is preserved when the reward is updated via a gradient-based rule with bounded gradient and on the same timescale as the actor and critic, offering a theoretical foundation for many popular RL techniques. As a secondary contribution, we introduce a novel analysis of distribution mismatch under Markovian sampling, improving the best-known rate by a factor of $\log^2T$ in the static-reward case.