Online Episodic Convex Reinforcement Learning
Moreno, Bianca Marin, Eldowa, Khaled, Gaillard, Pierre, Brégère, Margaux, Oudjane, Nadia
We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent's policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use an online mirror descent algorithm with varying constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent's policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.
May-13-2025
- Country:
- North America > United States
- Massachusetts > Middlesex County > Cambridge (0.04)
- Europe
- Spain
- Canary Islands (0.04)
- Andalusia > Cádiz Province
- Cadiz (0.04)
- Italy > Lombardy
- Milan (0.04)
- France
- Île-de-France > Paris
- Paris (0.04)
- Auvergne-Rhône-Alpes > Isère
- Grenoble (0.04)
- Île-de-France > Paris
- Spain
- North America > United States
- Genre:
- Research Report (0.81)
- Workflow (0.67)
- Industry:
- Education > Educational Setting (0.34)