Goto

Collaborating Authors

 properness






Principled Foundations for Preference Optimization

Zhou, Wenxuan, Zhang, Shujian, Magdalou, Brice, Lambert, John, Amid, Ehsan, Nock, Richard, Hard, Andrew

arXiv.org Artificial Intelligence

The connection is established for all of Savage's DPO framework to generalize its functional parts (Alfano et al., 2025; Azar et al., 2024; Chen et al., The latter involves elements from Doignon-Falmagne's stochastic choice These many design elements lead to a generalization making the most of the connection since we encompass all of properness on Savage's side (regardless of optional properties like symmetry, We also encompass all of the modelling's power on Krantz, Luce, Suppes and Notably, our level of generalization is able to support "for free" important This is an important task because DPO was designed with the objective to simplify RLHF and getting "above" DPO is mandatory to improve results by getting more freedom on reward shapes, trajectories and preference behaviours (Gupta et al., 2025), all of which needs to be done while One perhaps unexpected pitfall comes from the RLHF/DPO inherited "gold To preserve readability, all proofs are given in an appendix. We adopt many definitions from Rafailov et al. (2023).


Improving calibration by relating focal loss, temperature scaling, and properness

AIHub

In machine learning classification tasks, achieving high accuracy is only part of the goal; it's equally important for models to express how confident they are in their predictions – a concept known as model calibration. Well-calibrated models provide probability estimates that closely reflect the true likelihood of outcomes, which is critical in domains like healthcare, finance, and autonomous systems, where decision-making relies on trustworthy predictions. A key factor influencing both the accuracy and calibration of a model is the choice of the loss function during training. The loss function guides the model on how to learn from data by penalizing errors in prediction in a certain way. In this blog post, we will explore how to choose a loss function to achieve good calibration, focusing on the recently proposed focal loss and trying to understand why it leads to quite well-calibrated performance.


Composite Multiclass Losses

Neural Information Processing Systems

We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a "proper composite loss", which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on "classification calibration" by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses.


Attention to Entropic Communication

Enßlin, Torsten, Weidinger, Carolin, Frank, Philipp

arXiv.org Machine Learning

The concept of attention, numerical weights that emphasize the importance of particular data, has proven to be very relevant in artificial intelligence. Relative entropy (RE, aka Kullback-Leibler divergence) plays a central role in communication theory. Here we combine these concepts, attention and RE. RE guides optimal encoding of messages in bandwidth-limited communication as well as optimal message decoding via the maximum entropy principle (MEP). In the coding scenario, RE can be derived from four requirements, namely being analytical, local, proper, and calibrated. Weighted RE, used for attention steering in communications, turns out to be improper. To see how proper attention communication can emerge, we analyze a scenario of a message sender who wants to ensure that the receiver of the message can perform well-informed actions. If the receiver decodes the message using the MEP, the sender only needs to know the receiver's utility function to inform optimally, but not the receiver's initial knowledge state. In case only the curvature of the utility function maxima are known, it becomes desirable to accurately communicate an attention function, in this case a by this curvature weighted and re-normalized probability function. Entropic attention communication is here proposed as the desired generalization of entropic communication that permits weighting while being proper, thereby aiding the design of optimal communication protocols in technical applications and helping to understand human communication. For example, our analysis shows how to derive the level of cooperation expected under misaligned interests of otherwise honest communication partners.


The Geometry of Mixability

Pacheco, Armando J. Cabrera, Williamson, Robert C.

arXiv.org Artificial Intelligence

Mixable loss functions are of fundamental importance in the context of prediction with expert advice in the online setting since they characterize fast learning rates. By re-interpreting properness from the point of view of differential geometry, we provide a simple geometric characterization of mixability for the binary and multi-class cases: a proper loss function $\ell$ is $\eta$-mixable if and only if the superpredition set $\textrm{spr}(\eta \ell)$ of the scaled loss function $\eta \ell$ slides freely inside the superprediction set $\textrm{spr}(\ell_{\log})$ of the log loss $\ell_{\log}$, under fairly general assumptions on the differentiability of $\ell$. Our approach provides a way to treat some concepts concerning loss functions (like properness) in a ''coordinate-free'' manner and reconciles previous results obtained for mixable loss functions for the binary and the multi-class cases.


Scoring rules in survival analysis

Sonabend, Raphael

arXiv.org Artificial Intelligence

Scoring rules evaluate probabilistic predictions and (attempt to) measure the overall predictive ability of a model as a combination of calibration and discrimination [Gneiting and Raftery, 2007, Murphy, 1973]. Scoring rules have been gaining in popularity for the past couple of decades since probabilistic forecasts were recognised to be superior than deterministic predictions for capturing uncertainty in predictions [Dawid, 1984, 1986]. Formalisation and development of scoring rules has primarily been due to Dawid [Dawid, 1984, 1986, Dawid and Musio, 2014], Gneiting and Raftery [Gneiting and Raftery, 2007]; though the earliest measures promoting "rational" and "honest" decision making date back to the 1950s [Brier, 1950, Good, 1952]. In classification and (probabilistic) regression [Gressmann et al., 2018] settings there are established definitions for scoring rules and specific losses have been defined, most popular of which are the Brier score and Logloss. However the literature has been lacking for survival analysis, with no definition of a scoring rule being proposed until very recently, despite this losses are frequently utilised in the literature without justification or proofs about their properties. In this paper we will present a formal definition for a survival scoring rule as well as a second definition for an'approximate' survival scoring rule that can be utilised under specific conditions. We provide a brief review of losses in the literature and collate claims, proofs, and disproofs for properness of these loses.