AITopics | td algorithm

Collaborating Authors

td algorithm

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

e2c4a40d50b47094f571e40efead3900-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-14-2026, 17:22:17 GMT

assumption, formula, mean square error, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.54)

Add feedback

sponse addressing one common point raised by Reviewer 1 and Reviewer 3 regarding how to handle the case where 2 null

Neural Information Processing SystemsAug-20-2025, 06:29:57 GMT

We thank all the reviewers for their careful feedback and will revise our paper accordingly. Such a fact is presented in the classic paper "An analysis of temporal-difference learning with function Similar facts can be found for other TD algorithms (e.g. Reviewer 1 is correct in that a discount factor is needed. Now we address specific reviewer comments below. A reference for this is the classic paper "An Finally, the "-" sign in Line 213 is due to the Hurwtiz assumption.

assumption, reviewer 1, reviewer 3, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Stabilizing Temporal Difference Learning via Implicit Stochastic Recursion

Kim, Hwanwoo, Toulis, Panos, Laber, Eric

arXiv.org Machine LearningJun-24-2025

Temporal difference (TD) learning is a foundational algorithm in reinforcement learning (RL). For nearly forty years, TD learning has served as a workhorse for applied RL as well as a building block for more complex and specialized algorithms. However, despite its widespread use, TD procedures are generally sensitive to step size specification. A poor choice of step size can dramatically increase variance and slow convergence in both on-policy and off-policy evaluation tasks. In practice, researchers use trial and error to identify stable step sizes, but these approaches tend to be ad hoc and inefficient. As an alternative, we propose implicit TD algorithms that reformulate TD updates into fixed point equations. Such updates are more stable and less sensitive to step size without sacrificing computational efficiency. Moreover, we derive asymptotic convergence guarantees and finite-time error bounds for our proposed implicit TD algorithms, which include implicit TD(0), TD($λ$), and TD with gradient correction (TDC). Our results show that implicit TD algorithms are applicable to a much broader range of step sizes, and thus provide a robust and versatile framework for policy evaluation and value approximation in modern RL tasks. We demonstrate these benefits empirically through extensive numerical examples spanning both on-policy and off-policy tasks.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2505.01361

Country:

North America > Canada > Alberta (0.13)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP

Sangadi, Tejaram, Prashanth, L. A., Jagannathan, Krishna

arXiv.org Artificial IntelligenceJun-12-2024

In the standard reinforcement learning (RL) setting, the objective is to learn a policy that maximizes the value function, which is the expectation of the cumulative reward that is obtained over a finite or infinite time horizon. However, in several practical scenarios including finance, automated driving and drug testing, a risk sensitive learning paradigm assumes importance, wherein the value function, which is an expectation, needs to be traded off suitably with an appropriate risk metric associated with the reward distribution. One way to achieve this is to solve a constrained optimization problem with this risk metric as a constraint, and the value function as the objective. Variance is a popular risk measure, which is usually incorporated into a risk-sensitive optimization problem as a constraint, with the usual expected value as the objective. Such a mean-variance formulation was studied in the seminal work of Markowitz [10]. In the context of RL, mean-variance optimization has been considered in several previous works, cf.

finite time analysis, mean-variance, temporal difference learning, (12 more...)

arXiv.org Artificial Intelligence

2406.07892

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models

Pan, Yangchen, Wen, Junfeng, Xiao, Chenjun, Torr, Philip

arXiv.org Artificial IntelligenceMay-9-2024

In traditional statistical learning, data points are usually assumed to be independently and identically distributed (i.i.d.) following an unknown probability distribution. This paper presents a contrasting viewpoint, perceiving data points as interconnected and employing a Markov reward process (MRP) for data modeling. We reformulate the typical supervised learning as an on-policy policy evaluation problem within reinforcement learning (RL), introducing a generalized temporal difference (TD) learning algorithm as a resolution. Theoretically, our analysis draws connections between the solutions of linear TD learning and ordinary least squares (OLS). We also show that under specific conditions, particularly when noises are correlated, the TD's solution proves to be a more effective estimator than OLS. Furthermore, we establish the convergence of our generalized TD algorithms under linear function approximation. Empirical studies verify our theoretical results, examine the vital design of our TD algorithm and show practical utility across various datasets, encompassing tasks such as regression and image classification with deep learning.

algorithm, dataset, matrix, (14 more...)

arXiv.org Artificial Intelligence

2404.15518

Country:

Oceania > Australia (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > Canada > Ontario > National Capital Region > Ottawa (0.14)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Sample and Communication Efficient Fully Decentralized MARL Policy Evaluation via a New Approach: Local TD update

Hairi, Fnu, Zhang, Zifan, Liu, Jia

arXiv.org Artificial IntelligenceMar-23-2024

In actor-critic framework for fully decentralized multi-agent reinforcement learning (MARL), one of the key components is the MARL policy evaluation (PE) problem, where a set of $N$ agents work cooperatively to evaluate the value function of the global states for a given policy through communicating with their neighbors. In MARL-PE, a critical challenge is how to lower the sample and communication complexities, which are defined as the number of training samples and communication rounds needed to converge to some $\epsilon$-stationary point. To lower communication complexity in MARL-PE, a "natural'' idea is to perform multiple local TD-update steps between each consecutive rounds of communication to reduce the communication frequency. However, the validity of the local TD-update approach remains unclear due to the potential "agent-drift'' phenomenon resulting from heterogeneous rewards across agents in general. This leads to an interesting open question: Can the local TD-update approach entail low sample and communication complexities? In this paper, we make the first attempt to answer this fundamental question. We focus on the setting of MARL-PE with average reward, which is motivated by many multi-agent network optimization problems. Our theoretical and experimental results confirm that allowing multiple local TD-update steps is indeed an effective approach in lowering the sample and communication complexities of MARL-PE compared to consensus-based MARL-PE algorithms. Specifically, the local TD-update steps between two consecutive communication rounds can be as large as $\mathcal{O}(1/\epsilon^{1/2}\log{(1/\epsilon)})$ in order to converge to an $\epsilon$-stationary point of MARL-PE. Moreover, we show theoretically that in order to reach the optimal sample complexity, the communication complexity of local TD-update approach is $\mathcal{O}(1/\epsilon^{1/2}\log{(1/\epsilon)})$.

algorithm, communication round, complexity, (12 more...)

arXiv.org Artificial Intelligence

2403.15935

Country:

Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Wisconsin (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Neural Information Processing SystemsFeb-16-2024, 11:07:20 GMT

We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD( \lambda), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al (2009a,b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman-error, and algorithms that perform stochastic gradient-descent on this function. In this paper, we generalize their work to nonlinear function approximation.

algorithm, arbitrary smooth function approximation, convergent temporal-difference learning, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Bayesian Approach to Linear Bayesian Networks

Hwang, Seyong, Lee, Kyoungjae, Oh, Sunmin, Park, Gunwoong

arXiv.org Machine LearningNov-27-2023

This study proposes the first Bayesian approach for learning high-dimensional linear Bayesian networks. The proposed approach iteratively estimates each element of the topological ordering from backward and its parent using the inverse of a partial covariance matrix. The proposed method successfully recovers the underlying structure when Bayesian regularization for the inverse covariance matrix with unequal shrinkage is applied. Specifically, it shows that the number of samples $n = \Omega( d_M^2 \log p)$ and $n = \Omega(d_M^2 p^{2/m})$ are sufficient for the proposed algorithm to learn linear Bayesian networks with sub-Gaussian and 4m-th bounded-moment error distributions, respectively, where $p$ is the number of nodes and $d_M$ is the maximum degree of the moralized graph. The theoretical findings are supported by extensive simulation studies including real data analysis. Furthermore the proposed method is demonstrated to outperform state-of-the-art frequentist approaches, such as the BHLSM, LISTEN, and TD algorithms in synthetic data.

artificial intelligence, bayesian inference, machine learning, (19 more...)

arXiv.org Machine Learning

2311.1561

Country:

Asia > South Korea > Seoul > Seoul (0.05)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Spain > Canary Islands (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

Patil, Gandharv, A., Prashanth L., Nagaraj, Dheeraj, Precup, Doina

arXiv.org Machine LearningSep-11-2023

We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

arXiv.org Machine Learning

2210.05918

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.41)

Add feedback

Fairness and Bias in Truth Discovery Algorithms: An Experimental Analysis

Lazier, Simone, Thirumuruganathan, Saravanan, Anahideh, Hadis

arXiv.org Artificial IntelligenceApr-25-2023

Machine learning (ML) based approaches are increasingly being used in a number of applications with societal impact. Training ML models often require vast amounts of labeled data, and crowdsourcing is a dominant paradigm for obtaining labels from multiple workers. Crowd workers may sometimes provide unreliable labels, and to address this, truth discovery (TD) algorithms such as majority voting are applied to determine the consensus labels from conflicting worker responses. However, it is important to note that these consensus labels may still be biased based on sensitive attributes such as gender, race, or political affiliation. Even when sensitive attributes are not involved, the labels can be biased due to different perspectives of subjective aspects such as toxicity. In this paper, we conduct a systematic study of the bias and fairness of TD algorithms. Our findings using two existing crowd-labeled datasets, reveal that a non-trivial proportion of workers provide biased results, and using simple approaches for TD is sub-optimal. Our study also demonstrates that popular TD algorithms are not a panacea. Additionally, we quantify the impact of these unfair workers on downstream ML tasks and show that conventional methods for achieving fairness and correcting label biases are ineffective in this setting. We end the paper with a plea for the design of novel bias-aware truth discovery algorithms that can ameliorate these issues.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2304.12573

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.79)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.71)
Information Technology > Communications > Social Media > Crowdsourcing (0.51)

Add feedback