Queue Up Your Regrets: Achieving the Dynamic Capacity Region of Multiplayer Bandits
–Neural Information Processing Systems
Consider N cooperative agents such that for T turns, each agent n takes an action an and receives a stochastic reward rn (a1,...,aN). Agents cannot observe the actions of other agents and do not know even their own reward function. The agents can communicate with their neighbors on a connected graph Gwith diameter d(G). We want each agent nto achieve an expected average reward of at least λn over time, for a given quality of service (QoS) vector λ. AQoS vector λis not necessarily achievable.
Neural Information Processing Systems
Apr-24-2026, 09:17:13 GMT