Audits as Evidence: Experiments, Ensembles, and Enforcement

arXiv.org Machine Learning

We develop tools for utilizing correspondence experiments to detect illegal discrimination by individual employers. Employers violate US employment law if their propensity to contact applicants depends on protected characteristics such as race or sex. We establish identification of higher moments of the causal effects of protected characteristics on callback rates as a function of the number of fictitious applications sent to each job ad. These moments are used to bound the fraction of jobs that illegally discriminate. Applying our results to three experimental datasets, we find evidence of significant employer heterogeneity in discriminatory behavior, with the standard deviation of gaps in job-specific callback probabilities across protected groups averaging roughly twice the mean gap. In a recent experiment manipulating racially distinctive names, we estimate that at least 85% of jobs that contact both of two white applications and neither of two black applications are engaged in illegal discrimination. To assess the tradeoff between type I and II errors presented by these patterns, we consider the performance of a series of decision rules for investigating suspicious callback behavior under a simple two-type model that rationalizes the experimental data. Though, in our preferred specification, only 17% of employers are estimated to discriminate on the basis of race, we find that an experiment sending 10 applications to each job would enable accurate detection of 7-10% of discriminators while falsely accusing fewer than 0.2% of non-discriminators. A minimax decision rule acknowledging partial identification of the joint distribution of callback rates yields higher error rates but more investigations than our baseline two-type model. Our results suggest illegal labor market discrimination can be reliably monitored with relatively small modifications to existing audit designs.


Integrating Case-Based and Rule-Based Reasoning: the Possibilistic Connection

arXiv.org Artificial Intelligence

Rule based reasoning (RBR) and case based reasoning (CBR) have emerged as two important and complementary reasoning methodologies in artificial intelligence (Al). For problem solving in complex, real world situations, it is useful to integrate RBR and CBR. This paper presents an approach to achieve a compact and seamless integration of RBR and CBR within the base architecture of rules. The paper focuses on the possibilistic nature of the approximate reasoning methodology common to both CBR and RBR. In CBR, the concept of similarity is casted as the complement of the distance between cases. In RBR the transitivity of similarity is the basis for the approximate deductions based on the generalized modus ponens. It is shown that the integration of CBR and RBR is possible without altering the inference engine of RBR. This integration is illustrated in the financial domain of mergers and acquisitions. These ideas have been implemented in a prototype system called MARS.


A Simultaneous Transformation and Rounding Approach for Modeling Integer-Valued Data

arXiv.org Machine Learning

Integer-valued and count data are ubiquitous in many fields, including epidemiology (Osthus et al., 2018; Kowal, 2019), ecology (Dorazio et al., 2005), and insurance (Bening and Korolev, 2012), among others (Cameron and Trivedi, 2013). Count data also serve as an indicator of demand, such as the demand for medical services (Deb and Trivedi, 1997), emergency medical services (Matteson et al., 2011), and call center access (Shen and Huang, 2008). In these applications and many others, integer-valued data are frequently observed jointly with predictors, over time intervals, or across spatial locations. Integer-valued data also exhibit a variety of distributional features, including zero-inflation, skewness, over-or underdispersion, and in some cases may be bounded or censored. Flexible and interpretable models for integervalued processes are therefore highly useful in practice. The most widely-used models for count data build upon the Poisson distribution. However, the limitations of the Poisson distribution are well-known: the distribution is not sufficiently flexible in practice and cannot account for zero-inflation or over-and underdispersion. A common strategy is to generalize the Poisson model by introducing additional parameters.


Causal Knowledge Network Integration for Life Cycle Assessment

AAAI Conferences

Sustainability requires emphasizing the importance of environmental causes and effects among design knowledge from heterogeneous stakeholders to make a sustainable decision. Recently, such causes and effects have been well developed in ontological representation, which has been challenged to generate and integrate multiple domain knowledge due to its domain specific characteristics. Moreover, it is too challengeable to represent heterogeneous, domain-specific design knowledge in a standardized way. Causal knowledge can meet the necessity of knowledge integration in domains. Therefore, this paper aims to develop a causal knowledge integration system with the authors’ previous mathematical causal knowledge representation.


A Generalized Fellegi-Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems

arXiv.org Machine Learning

We present a probabilistic method for linking multiple datafiles. This task is not trivial in the absence of unique identifiers for the individuals recorded. This is a common scenario when linking census data to coverage measurement surveys for census coverage evaluation, and in general when multiple record-systems need to be integrated for posterior analysis. Our method generalizes the Fellegi-Sunter theory for linking records from two datafiles and its modern implementations. The multiple record linkage goal is to classify the record K-tuples coming from K datafiles according to the different matching patterns. Our method incorporates the transitivity of agreement in the computation of the data used to model matching probabilities. We use a mixture model to fit matching probabilities via maximum likelihood using the EM algorithm. We present a method to decide the record K-tuples membership to the subsets of matching patterns and we prove its optimality. We apply our method to the integration of three Colombian homicide record systems and we perform a simulation study in order to explore the performance of the method under measurement error and different scenarios. The proposed method works well and opens some directions for future research.