Europe
Bias in Evaluation Processes: An Optimization-Based Model
Biases with respect to socially-salient attributes of individuals have been well documented in evaluation processes used in settings such as admissions and hiring. We view such an evaluation process as a transformation of a distribution of the true utility of an individual for a task to an observed distribution and model it as a solution to a loss minimization problem subject to an information constraint. Our model has two parameters that have been identified as factors leading to biases: the resource-information trade-off parameter in the information constraint and the risk-averseness parameter in the loss function. We characterize the distributions that arise from our model and study the effect of the parameters on the observed distribution. The outputs of our model enrich the class of distributions that can be used to capture variation across groups in the observed evaluations. We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. These results contribute to an understanding of the emergence of bias in evaluation processes and provide tools to guide the deployment of interventions to mitigate biases.
Supplementary for Emergence of Shape Bias in Convolutional Neural Networks through Activation Sparsity 1 Further Results of the impact of sparsity on Shape Bias Benchmark
We utilize the sparsity operation proposed in Section 3.1 for ResNet-50. For ViT, we also apply the spatial Top-K operation as described in the general response. We can observe an increase in both ResNet-50 and ViT-B architectures, furthering closing the gap between human and existing models. We generalize section 4.2 in the main text to ResNet-50 and ViT-B architectures (Figure 1). The ResNet-50's sparsity definition is the same as AlexNet and VGG. For ViT-B, we reshape the intermediate activation response from [n, h * w, d] to [n, d, h * w] and apply the Top-K selection over dimension 2 before the activation is passed through the multiple head attention (Note that h and w is the height and weight of the latent tensor after reshape it to 2d, for ViT-B with patch size 16 on the 224x224 images, h=w=14, n denotes the batch size).
ID and OODPerformance Are Sometimes Inversely Correlated on Real-world Datasets
Several studies have compared the in-distribution (ID) and out-ofdistribution (OOD) performance of models in computer vision and NLP. They report a frequent positive correlation, but surprisingly, almost never an inverse correlation that would be indicative of a necessary trade-off. Such inverse patterns are possible theoretically, and their occurrence in practice is important to determine whether ID performance can serve as a proxy for OOD generalization.
e2cfb719f58585f779d0a4f9f07bd618-Supplemental-Datasets_and_Benchmarks.pdf
A.1 Creation of the Multimodal Web Document Dataset A.1.1 Collecting of a Large Number of HTMLFiles Our data collection process begins by considering the 25 most recent Common Crawl6 dumps available at the time of dataset creation. It contains webpages spanning from February 2020 to January/February 2023. We use a modified version of readability-lxml7 to extract the main text from the pages, discarding any pages that contain text of excessively high perplexity. This process yields a total of 41.2 billion documents. Selection of English content To identify non-English content, we apply the FastText classifier (Joulin et al., 2017) to the extracted text, e ectively filtering out 63.6% of the documents. Early text deduplication Often, a set of URLs is crawled repeatedly across di erent Common Crawl snapshots. However, the content of these websites may vary as web administrators make changes over time. Hence, at this stage, we refrain from deduplicating documents based on their URLs. Instead, we perform MinHash (Broder, 1997) deduplication with 16 hashes calculated over 5-grams. To further refine the data, we eliminate documents containing substantial proportions of repeated paragraphs and n-grams, employing the methodology described in MassiveText (Rae et al., 2022).
AFast Convoluted Story: Scaling Probabilistic Inference for Integer Arithmetic
As illustrated by the success of integer linear programming, linear integer arithmetic is a powerful tool for modelling combinatorial problems. Furthermore, the probabilistic extension of linear programming has been used to formulate problems in neurosymbolic AI. However, two key problems persist that prevent the adoption of neurosymbolic techniques beyond toy problems. First, probabilistic inference is inherently hard, #P-hard to be precise. Second, the discrete nature of integers renders the construction of meaningful gradients challenging, which is problematic for learning. In order to mitigate these issues, we formulate linear arithmetic over integer-valued random variables as tensor manipulations that can be implemented in a straightforward fashion using modern deep learning libraries. At the core of our formulation lies the observation that the addition of two integer-valued random variables can be performed by adapting the fast Fourier transform to probabilities in the log-domain. By relying on tensor operations we obtain a differentiable data structure, which unlocks, virtually for free, gradient-based learning. In our experimental validation we show that tensorising probabilistic linear integer arithmetic and leveraging the fast Fourier transform allows us to push the state of the art by several orders of magnitude in terms of inference and learning times.