important criteria
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Gunjal, Anisha, Wang, Anthony, Lau, Elaine, Nath, Vaskar, He, Yunzhong, Liu, Bing, Hendryx, Sean
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for complex reasoning tasks with clear correctness signals such as math and coding. However, extending it to real-world reasoning tasks is challenging, as evaluation depends on nuanced, multi-criteria judgments rather than binary correctness. Instance-specific rubrics have recently been used in evaluation benchmarks to capture such judgments, but their potential as reward signals for on-policy post-training remains underexplored. We introduce $\textbf{Rubrics as Rewards}$ (RaR), an on-policy reinforcement learning method that extends RLVR beyond verifiable domains by using rubric-based feedback. Across both medical and science domains, we evaluate multiple strategies for aggregating rubric feedback into rewards. The best RaR variant achieves relative improvements of up to $31\%$ on HealthBench and $7\%$ on GPQA-Diamond over popular LLM-as-judge baselines that rely on direct Likert-based rewards. These results demonstrate that RaR-trained policies adapt well to diverse evaluation formats, performing strongly on both rubric-based and multiple-choice tasks. Moreover, we find that using rubrics as structured reward signals yields better alignment for smaller judges and reduces performance variance across judge scales.
Wilson
Preference Inference involves inferring additional user preferences from elicited or observed preferences, based on assumptions regarding the form of the user's preference relation. In this paper we consider a situation in which alternatives have an associated vector of costs, each component corresponding to a different criterion, and are compared using a kind of lexicographic order, similar to the way alternatives are compared in a Hierarchical Constraint Logic Programming model. It is assumed that the user has some (unknown) importance ordering on criteria, and that to compare two alternatives, firstly, the combined cost of each alternative with respect to the most important criteria are compared; only if these combined costs are equal, are the next most important criteria considered. The preference inference problem then consists of determining whether a preference statement can be inferred from a set of input preferences. We show that this problem is co-NP-complete, even if one restricts the cardinality of the equal-importance sets to have at most two elements, and one only considers non-strict preferences. However, it is polynomial if it is assumed that the user's ordering of criteria is a total ordering; it is also polynomial if the sets of equally important criteria are all equivalence classes of a given fixed equivalence relation. We give an efficient polynomial algorithm for these cases, which also throws light on the structure of the inference.
Computation and Complexity of Preference Inference Based on Hierarchical Models
Wilson, Nic (University College Cork) | George, Anne-Marie (University College Cork) | O' (University College Cork) | Sullivan, Barry
Preference Inference involves inferring additional user preferences from elicited or observed preferences, based on assumptions regarding the form of the user's preference relation. In this paper we consider a situation in which alternatives have an associated vector of costs, each component corresponding to a different criterion, and are compared using a kind of lexicographic order, similar to the way alternatives are compared in a Hierarchical Constraint Logic Programming model. It is assumed that the user has some (unknown) importance ordering on criteria, and that to compare two alternatives, firstly, the combined cost of each alternative with respect to the most important criteria are compared; only if these combined costs are equal, are the next most important criteria considered. The preference inference problem then consists of determining whether a preference statement can be inferred from a set of input preferences. We show that this problem is co-NP-complete, even if one restricts the cardinality of the equal-importance sets to have at most two elements, and one only considers non-strict preferences. However, it is polynomial if it is assumed that the user's ordering of criteria is a total ordering; it is also polynomial if the sets of equally important criteria are all equivalence classes of a given fixed equivalence relation. We give an efficient polynomial algorithm for these cases, which also throws light on the structure of the inference.