Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

Open in new window