The Curse of Diversity in Ensemble-Based Exploration
Lin, Zhixuan, D'Oro, Pierluca, Nikishin, Evgenii, Courville, Aaron
–arXiv.org Artificial Intelligence
We uncover a surprising phenomenon in deep reinforcement learning: training a diverse ensemble of data-sharing agents - a well-established exploration strategy - can significantly impair the performance of the individual ensemble members when compared to standard single-agent training. Through careful analysis, we attribute the degradation in performance to the low proportion of self-generated data in the shared training data for each ensemble member, as well as the inefficiency of the individual ensemble members to learn from such highly off-policy data. We thus name this phenomenon the curse of diversity. We find that several intuitive solutions - such as a larger replay buffer or a smaller ensemble size - either fail to consistently mitigate the performance loss or undermine the advantages of ensembling. Finally, we demonstrate the potential of representation learning to counteract the curse of diversity with a novel method named Cross-Ensemble Representation Learning (CERL) in both discrete and continuous control domains. Our work offers valuable insights into an unexpected pitfall in ensemble-based exploration and raises important caveats for future applications of similar approaches. The potential benefits of a diverse ensemble are twofold. At training time, it enables concurrent exploration with multiple distinct policies without the need for additional samples. At test time, the learned policies can be aggregated into a robust ensemble policy, via aggregation methods such as majority voting (Osband et al., 2016) or averaging (Januszewski et al., 2021). Despite the generally positive perception of ensemble-based exploration, we argue that this approach has a potentially negative aspect that has been long overlooked. As shown in Figure 1, for each member in a data-sharing ensemble, only a small proportion of its training data comes from its own interaction with the environment. The majority of its training data is generated by other members of the ensemble, whose policies might be distinct from its own policy. This type of off-policy learning has been shown to be highly challenging in previous work (Ostrovski et al., 2021). We thus hypothesize that similar learning difficulties can also occur in ensemble-based exploration. We verify our hypothesis in the Arcade Learning Environment (Bellemare et al., 2012) with the Bootstrapped DQN (Osband et al., 2016) algorithm and the Gym MuJoCo benchmark (Towers et al., 2023) with an ensemble SAC (Haarnoja et al., 2018a) algorithm. We show that, in many environments, the individual members of a data-sharing ensemble significantly underperform their single-agent counterparts. Moreover, while aggregating the policies of all ensemble members via voting or averaging sometimes compensates for the degradation in individual members' performance, it is not always the case.
arXiv.org Artificial Intelligence
May-7-2024
- Country:
- North America > Canada > Quebec (0.14)
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Energy > Oil & Gas
- Upstream (0.34)
- Leisure & Entertainment > Games (0.48)
- Energy > Oil & Gas
- Technology: