To this end, we use kernel-based regularity assumptions to capture and exploit the structure in the opponent's response. We propose a novel algorithm for the learner when playing against an adversarial sequence of opponents.
Specifically, it has been known for more than 20 years that when all players seek to minimize their internal regret in a repeated normal-form game, the empirical frequency of play converges to a normal-form correlated equilibrium.
MARL in real scenarios is still challenging due to the same safety and efficiency concerns in single-agent setting, then it is worth conducting investigation for offline RL in multi-agent setting.