The Majority is not always right: RL training for solution aggregation