AITopics | Zhengyuan Zhou

Countering Feedback Delays in Multi-Agent Learning

Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Peter W. Glynn, Claire Tomlin

Neural Information Processing SystemsMay-27-2025, 23:58:16 GMT

We consider a model of game-theoretic learning based on online mirror descent (OMD) with asynchronous and delayed feedback information. Instead of focusing on specific games, we consider a broad class of continuous games defined by the general equilibrium stability notion, which we call λ-variational stability. Our first contribution is that, in this class of games, the actual sequence of play induced by OMD-based learning converges to Nash equilibria provided that the feedback delays faced by the players are synchronous and bounded. Subsequently, to tackle fully decentralized, asynchronous environments with (possibly) unbounded delays between actions and feedback, we propose a variant of OMD which we call delayed mirror descent (DMD), and which relies on the repeated leveraging of past information. With this modification, the algorithm converges to Nash equilibria with no feedback synchronicity assumptions and even when the delays grow superlinearly relative to the horizon of play.

artificial intelligence, assumption, machine learning, (14 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.14)

Industry: Education (0.47)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.50)

Add feedback

Learning in Games with Lossy Feedback

Zhengyuan Zhou, Panayotis Mertikopoulos, Susan Athey, Nicholas Bambos, Peter W. Glynn, Yinyu Ye

Neural Information Processing SystemsMay-26-2025, 04:22:50 GMT

We consider a game-theoretical multi-agent learning problem where the feedback information can be lost during the learning process and rewards are given by a broad class of games known as variationally stable games. We propose a simple variant of the classical online gradient descent algorithm, called reweighted online gradient descent (ROGD) and show that in variationally stable games, if each agent adopts ROGD, then almost sure convergence to the set of Nash equilibria is guaranteed, even when the feedback loss is asynchronous and arbitrarily corrrelated among agents. We then extend the framework to deal with unknown feedback loss probabilities by using an estimator (constructed from past data) in its replacement. Finally, we further extend the framework to accomodate both asynchronous loss and stochastic rewards and establish that multi-agent ROGD learning still converges to the set of Nash equilibria in such settings. Together, these results contribute to the broad lanscape of multi-agent online learning by significantly relaxing the feedback information that is required to achieve desirable outcomes.

artificial intelligence, convergence, machine learning, (12 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Industry: Education (0.57)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Online EXP3 Learning in Adversarial Bandits with Delayed Feedback

Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, Jose Blanchet

Neural Information Processing SystemsMar-26-2025, 14:08:05 GMT

Consider a player that in each of T rounds chooses one of K arms.

algorithm, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America (0.46)

Genre: Instructional Material > Online (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Learning in Generalized Linear Contextual Bandits with Stochastic Delays

Zhengyuan Zhou, Renyuan Xu, Jose Blanchet

Neural Information Processing SystemsMar-23-2025, 14:39:51 GMT

In this paper, we consider online learning in generalized linear contextual bandits where rewards are not immediately observed. Instead, rewards are available to the decision maker only after some delay, which is unknown and stochastic, even though a decision must be made at each time step for an incoming set of contexts. We study the performance of upper confidence bound (UCB) based algorithms adapted to this delayed setting. In particular, we design a delay-adaptive algorithm, which we call Delayed UCB, for generalized linear contextual bandits using UCB-style exploration and establish regret bounds under various delay assumptions. In the important special case of linear contextual bandits, we further modify this algorithm and establish a tighter regret bound under the same delay assumptions. Our results contribute to the broad landscape of contextual bandits literature by establishing that UCB algorithms, which are widely deployed in modern recommendation engines, can be made robust to delays.

bandit, data mining, machine learning, (13 more...)

Neural Information Processing Systems

Country: North America (0.28)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.96)
Education > Educational Setting > Online (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining (0.94)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.34)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.34)

Add feedback

Learning in Games with Lossy Feedback

Zhengyuan Zhou, Panayotis Mertikopoulos, Susan Athey, Nicholas Bambos, Peter W. Glynn, Yinyu Ye

Neural Information Processing SystemsMar-23-2025, 12:58:58 GMT

We consider a game-theoretical multi-agent learning problem where the feedback information can be lost during the learning process and rewards are given by a broad class of games known as variationally stable games. We propose a simple variant of the classical online gradient descent algorithm, called reweighted online gradient descent (ROGD) and show that in variationally stable games, if each agent adopts ROGD, then almost sure convergence to the set of Nash equilibria is guaranteed, even when the feedback loss is asynchronous and arbitrarily corrrelated among agents. We then extend the framework to deal with unknown feedback loss probabilities by using an estimator (constructed from past data) in its replacement. Finally, we further extend the framework to accomodate both asynchronous loss and stochastic rewards and establish that multi-agent ROGD learning still converges to the set of Nash equilibria in such settings. Together, these results contribute to the broad lanscape of multi-agent online learning by significantly relaxing the feedback information that is required to achieve desirable outcomes.

artificial intelligence, convergence, machine learning, (12 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Industry: Education (0.57)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Online EXP3 Learning in Adversarial Bandits with Delayed Feedback

Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, Jose Blanchet

Neural Information Processing SystemsJan-26-2025, 10:43:02 GMT

Consider a player that in each of T rounds chooses one of K arms.

algorithm, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America (0.46)

Genre: Instructional Material > Online (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Learning in Generalized Linear Contextual Bandits with Stochastic Delays

Zhengyuan Zhou, Renyuan Xu, Jose Blanchet

Neural Information Processing SystemsJan-23-2025, 20:34:32 GMT

In this paper, we consider online learning in generalized linear contextual bandits where rewards are not immediately observed. Instead, rewards are available to the decision maker only after some delay, which is unknown and stochastic, even though a decision must be made at each time step for an incoming set of contexts. We study the performance of upper confidence bound (UCB) based algorithms adapted to this delayed setting. In particular, we design a delay-adaptive algorithm, which we call Delayed UCB, for generalized linear contextual bandits using UCB-style exploration and establish regret bounds under various delay assumptions. In the important special case of linear contextual bandits, we further modify this algorithm and establish a tighter regret bound under the same delay assumptions. Our results contribute to the broad landscape of contextual bandits literature by establishing that UCB algorithms, which are widely deployed in modern recommendation engines, can be made robust to delays.

bandit, data mining, machine learning, (13 more...)

Neural Information Processing Systems

Country: North America (0.28)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.96)
Education > Educational Setting > Online (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining (0.94)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.34)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.34)

Add feedback

Stochastic Mirror Descent in Variationally Coherent Optimization Problems

Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Stephen Boyd, Peter W. Glynn

Neural Information Processing SystemsJan-20-2025, 05:14:40 GMT

In this paper, we examine a class of non-convex stochastic optimization problems which we call variationally coherent, and which properly includes pseudo-/quasiconvex and star-convex optimization problems. To solve such problems, we focus on the widely used stochastic mirror descent (SMD) family of algorithms (which contains stochastic gradient descent as a special case), and we show that the last iterate of SMD converges to the problem's solution set with probability 1. This result contributes to the landscape of non-convex stochastic optimization by clarifying that neither pseudo-/quasi-convexity nor star-convexity is essential for (almost sure) global convergence; rather, variational coherence, a much weaker requirement, suffices. Characterization of convergence rates for the subclass of strongly variationally coherent optimization problems as well as simulation results are also presented.

artificial intelligence, convergence, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.94)

Technology: