Goto

Collaborating Authors

 gender


Multicalibration Boosting: Theory, Convergence, and Transferability

arXiv.org Machine Learning

Multicalibration extends classical calibration by requiring predictions to be unbiased over a rich collection of functions, encompassing both prediction slices and subpopulations. It has emerged as a powerful framework for fairness, robustness, and reliable prediction, yet the theoretical understanding of multicalibration boosting (MCBoost) remains fragmented and often relies on restrictive assumptions. In this work, we develop a unified and refined perspective on MCBoost that subsumes existing variants, including multiaccuracy, BatchGCP, and BatchMVP. We uncover several phenomena that provide new insights into its practical behavior: even highly accurate and flexible predictors can remain substantially miscalibrated; enforcing multicalibration introduces a calibration-risk trade-off; and early stopping plays a central role in controlling this trade-off. On the theoretical side, we establish a general framework for MCBoost under weaker and more realistic conditions. We show that the boosting iterates converge to a Bregman projection of the population-optimal predictor onto the cumulative span generated by the audit class, thereby explicitly characterizing the function space on which multicalibration is achieved. We further derive convergence rates under different smoothness assumptions, finite-sample guarantees, and principled stopping rules that ensure multicalibration at termination. Finally, we extend the theory of universal adaptability under covariate shift, providing more general transfer guarantees and clarifying when multicalibrated predictors generalize across domains. These results provide a more complete theoretical foundation and practical guidance for multicalibration boosting, positioning it as both a unifying framework and a reliable post-processing approach for modern predictive models.



AMissing Proofs Theorem 1. The excessive loss of a group a Ais upper bounded by3: R(a) gโ„“a ฮธ ฮธ + 1 2 ฮป Hโ„“a ฮธ ฮธ

Neural Information Processing Systems

J( ฮธ; Da) is the Hessian matrix of the loss function โ„“, at the optimal parameters vector ฮธ, computed using the group data Da (henceforth simply referred to as group hessian), and ฮป(ฮฃ) is the maximum eigenvalue of a matrix ฮฃ. Proof. Using a second order Taylor expansion around ฮธ, the excessive loss R(a) for a group a A can be stated as: R(a) = J( ฮธ; Da) J( ฮธ; Da) = " J ฮธ; Da + ฮธ ฮธ Hโ„“a ฮธ ฮธ +O ฮธ ฮธ 3 The above, follows from the loss โ„“() being at least twice differentiable, by assumption. Consider two groups a and b in Awith |Da| |Db|. Proposition 2. For a given group a A, gradient norms can be upper bounded as: gโ„“a O X The above proposition is presented in the context of cross entropy loss or mean squared error loss functions. These two cases are reviewed as follows 3With a slight abuse of notation, the results refer to ฮธ as the homonymous vector which is extended with k k zeros.


220165f9c7f51163b73c8c7fff578b4e-Supplemental-Conference.pdf

Neural Information Processing Systems

This supplementary provides additional experiments as well as details that are required to reproduce our results. These were not included in the main paper due to space limitations. The supplementary is arranged as follows: Section A: Details on Modelling - Section A.1 Details of Theoretical Modelling - Section A.2 Additional Details on CLEAM Algorithm - Section A.3 Details on Fairness Metric - Section A.4 Details of Significance of the Baseline Errors Section B: Deeper Analysis on Error in Fairness Measurement Section C: Validating Statistical Model for Classifier Output - Section C.1 Validation of Sample-Based Estimate vs Model-Based Estimate - Section C.2 Goodness-of-Fit Test: ห†pfrom the Real GANs with Our Theoretical Model Section D: Additional Experimental Results - Section D.1 Experimental Results with Standard Deviation - Section D.2 Experimental Setup for Diversity - Section D.3 Measuring Varying Degrees of Bias (Gender and BlackHair) - Section D.4 Measuring Varying Degrees of ...



On Measuring Fairness in Generative Models

Neural Information Processing Systems

Recently, there has been increased interest in fair generative models. In this work, we conduct, for the first time, an in-depth study on fairness measurement, a critical component in gauging progress on fair generative models.


Appendix - An Image is Worth More Than a Thousand Words: Towards Disentanglement in The Wild Table of Contents

Neural Information Processing Systems

We use the images at 256 256resolution. We follow [21] and use all the images for training. The images used for the qualitative visualizations contain random images from the web and samples from CelebA-HQ. AFHQ [8] 15,000high quality images categorized into three domains: cat, dog and wildlife. We use the images at 128 128 resolution, holding out 500 images from each domain for testing.


Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models

Neural Information Processing Systems

The capabilities of natural language models trained on large-scale data have increased immensely over the past few years. Open source libraries such as HuggingFace have made these models easily available and accessible. While prior research has identified biases in large language models, this paper considers biases contained in the most popular versions of these models when applied'out-of-the-box' for downstream tasks. We focus on generative language models as they are well-suited for extracting biases inherited from training data. Specifically, we conduct an indepth analysis of GPT-2, which is the most downloaded text generation model on HuggingFace, with over half a million downloads per month. We assess biases related to occupational associations for different protected categories by intersecting gender with religion, sexuality, ethnicity, political affiliation, and continental name origin. Using a template-based data collection pipeline, we collect 396K sentence completions made by GPT-2 and find: (i) The machine-predicted jobs are less diverse and more stereotypical for women than for men, especially for intersections; (ii) Intersectional interactions are highly relevant for occupational associations, which we quantify by fitting 262 logistic models; (iii) For most occupations, GPT-2 reflects the skewed gender and ethnicity distribution found in USLabor Bureau data, and even pulls the societally-skewed distribution towards gender parity in cases where its predictions deviate from real labor market observations. This raises the normative question of what language models should learn - whether they should reflect or correct for existing inequalities.


Extraction

Neural Information Processing Systems

Figure 5 shows an schema explaining the extraction of the entities. Each step is depicted in a triplet format: subject,predicate,object . Blue (italics) information is the information extracted at each step. For each step outlined with a dotted rectangle (), the information extracted is the subject; otherwise, the information extracted is the object. Figure 6 show an example of multilingual alignment for the languages considered in the high-resource use case: English, Arabic, Spanish and Russian.


OCCGEN: Selection of Real-world Multilingual Parallel Data Balanced in Gender within Occupations

Neural Information Processing Systems

This paper describes the OCCGEN toolkit, which allows extracting multilingual parallel data balanced in gender within occupations. OCCGEN can extract datasets that reflect gender diversity (beyond binary) more fairly in society to be further used to explicitly mitigate occupational gender stereotypes. We propose two use cases that extract evaluation datasets for machine translation in four high-resource languages from different linguistic families and in a low-resource African language. Our analysis of these use cases shows that translation outputs in high-resource languages tend to worsen in feminine subsets (compared to masculine), specially in the directions containing English. This is confirmed by the human evaluation. We hypothesize that a sound language generation may contribute to pay less attention to the source sentence and to overgeneralize to the most frequent gender forms.