Goto

Collaborating Authors

 indiv


Fairness in Survival Analysis with Distributionally Robust Optimization

Hu, Shu, Chen, George H.

arXiv.org Machine Learning

We propose a general approach for encouraging fairness in survival analysis models based on minimizing a worst-case error across all subpopulations that occur with at least a user-specified probability. This approach can be used to convert many existing survival analysis models into ones that simultaneously encourage fairness, without requiring the user to specify which attributes or features to treat as sensitive in the training loss function. From a technical standpoint, our approach applies recent developments of distributionally robust optimization (DRO) to survival analysis. The complication is that existing DRO theory uses a training loss function that decomposes across contributions of individual data points, i.e., any term that shows up in the loss function depends only on a single training point. This decomposition does not hold for commonly used survival loss functions, including for the Cox proportional hazards model, its deep neural network variants, and many other recently developed models that use loss functions involving ranking or similarity score calculations. We address this technical hurdle using a sample splitting strategy. We demonstrate our sample splitting DRO approach by using it to create fair versions of a diverse set of existing survival analysis models including the Cox model (and its deep variant DeepSurv), the discrete-time model DeepHit, and the neural ODE model SODEN. We also establish a finite-sample theoretical guarantee to show what our sample splitting DRO loss converges to. For the Cox model, we further derive an exact DRO approach that does not use sample splitting. For all the models that we convert into DRO variants, we show that the DRO variants often score better on recently established fairness metrics (without incurring a significant drop in accuracy) compared to existing survival analysis fairness regularization techniques.


The Curse of Diversity in Ensemble-Based Exploration

Lin, Zhixuan, D'Oro, Pierluca, Nikishin, Evgenii, Courville, Aaron

arXiv.org Artificial Intelligence

We uncover a surprising phenomenon in deep reinforcement learning: training a diverse ensemble of data-sharing agents - a well-established exploration strategy - can significantly impair the performance of the individual ensemble members when compared to standard single-agent training. Through careful analysis, we attribute the degradation in performance to the low proportion of self-generated data in the shared training data for each ensemble member, as well as the inefficiency of the individual ensemble members to learn from such highly off-policy data. We thus name this phenomenon the curse of diversity. We find that several intuitive solutions - such as a larger replay buffer or a smaller ensemble size - either fail to consistently mitigate the performance loss or undermine the advantages of ensembling. Finally, we demonstrate the potential of representation learning to counteract the curse of diversity with a novel method named Cross-Ensemble Representation Learning (CERL) in both discrete and continuous control domains. Our work offers valuable insights into an unexpected pitfall in ensemble-based exploration and raises important caveats for future applications of similar approaches. The potential benefits of a diverse ensemble are twofold. At training time, it enables concurrent exploration with multiple distinct policies without the need for additional samples. At test time, the learned policies can be aggregated into a robust ensemble policy, via aggregation methods such as majority voting (Osband et al., 2016) or averaging (Januszewski et al., 2021). Despite the generally positive perception of ensemble-based exploration, we argue that this approach has a potentially negative aspect that has been long overlooked. As shown in Figure 1, for each member in a data-sharing ensemble, only a small proportion of its training data comes from its own interaction with the environment. The majority of its training data is generated by other members of the ensemble, whose policies might be distinct from its own policy. This type of off-policy learning has been shown to be highly challenging in previous work (Ostrovski et al., 2021). We thus hypothesize that similar learning difficulties can also occur in ensemble-based exploration. We verify our hypothesis in the Arcade Learning Environment (Bellemare et al., 2012) with the Bootstrapped DQN (Osband et al., 2016) algorithm and the Gym MuJoCo benchmark (Towers et al., 2023) with an ensemble SAC (Haarnoja et al., 2018a) algorithm. We show that, in many environments, the individual members of a data-sharing ensemble significantly underperform their single-agent counterparts. Moreover, while aggregating the policies of all ensemble members via voting or averaging sometimes compensates for the degradation in individual members' performance, it is not always the case.


What's "up" with vision-language models? Investigating their struggle with spatial reasoning

Kamath, Amita, Hessel, Jack, Chang, Kai-Wei

arXiv.org Artificial Intelligence

Recent vision-language (VL) models are powerful, but can they reliably distinguish "right" from "left"? We curate three new corpora to quantify model comprehension of such basic spatial relations. These tests isolate spatial reasoning more precisely than existing datasets like VQAv2, e.g., our What'sUp benchmark contains sets of photographs varying only the spatial relations of objects, keeping their identity fixed (see Figure 1: models must comprehend not only the usual case of a dog under a table, but also, the same dog on top of the same table). We evaluate 18 VL models, finding that all perform poorly, e.g., BLIP finetuned on VQAv2, which nears human parity on VQAv2, achieves 56% accuracy on our benchmarks vs. humans at 99%. We conclude by studying causes of this surprising behavior, finding: 1) that popular vision-language pretraining corpora like LAION-2B contain little reliable data for learning spatial relationships; and 2) that basic modeling interventions like up-weighting preposition-containing instances or fine-tuning on our corpora are not sufficient to address the challenges our benchmarks pose. We are hopeful that these corpora will facilitate further research, and we release our data and code at https://github.com/amitakamath/whatsup_vlms.