Direct Preference-Based Evolutionary Multi-Objective Optimization with Dueling Bandits

Neural Information Processing Systems 

The ultimate goal of multi-objective optimization (MO) is to assist human decision-makers (DMs) in identifying solutions of interest (SOI) that optimally reconcile multiple objectives according to their preferences. Yet, current PBEMO approaches are prone to be inefficient and misaligned with the DM's true aspirations, especially when inadvertently exploiting mis-calibrated reward models. This is further exacerbated when considering the stochastic nature of human feedback. This paper proposes a novel framework that navigates MO to SOI by directly leveraging human feedback without being restricted by a predefined reward model nor cumbersome model selection. Specifically, we developed a clustering-based stochastic dueling bandits algorithm that strategically scales well to high-dimensional dueling bandits, and achieves a regret of \mathcal{O}(K 2\log T), where K is the number of clusters and T is the number of rounds.