Accuracy
Uncertainty in Fairness Assessment: Maintaining Stable Conclusions Despite Fluctuations
Barrainkua, Ainhize, Gordaliza, Paula, Lozano, Jose A., Quadrianto, Novi
With the current adoption of machine learning (ML) systems in social, economic, and industrial domains, concerns about the fairness of automated decisions have been added to the problem of ensuring the efficiency of algorithms in a stable and interpretative manner. Although both aspects are measured in terms of performance metrics, fairness entails the additional challenge of incorporating sensitive information in the data and new procedures need to be considered to control the stability of such outcomes. Recent ML trends are increasingly encouraging researchers to incorporate uncertainty into the evaluation of algorithm-based systems. In order to increase the transparency of algorithmic performance measures, typically for comparison purposes, some authors [3, 19] propose to treat these metrics as random variables whose posterior distributions are updated through Bayesian inference. In the fair learning setting, these kinds of considerations are also necessary, especially since fairness metrics have been proved unstable with respect to dataset composition. In particular, Ji et al. [17] or Friedler et al. [12] showed how certain fairness metrics strongly vary, respectively, in hold-out
Out of Context: Investigating the Bias and Fairness Concerns of "Artificial Intelligence as a Service"
Lewicki, Kornel, Lee, Michelle Seng Ah, Cobbe, Jennifer, Singh, Jatinder
"AI as a Service" (AIaaS) is a rapidly growing market, offering various plug-and-play AI services and tools. AIaaS enables its customers (users) - who may lack the expertise, data, and/or resources to develop their own systems - to easily build and integrate AI capabilities into their applications. Yet, it is known that AI systems can encapsulate biases and inequalities that can have societal impact. This paper argues that the context-sensitive nature of fairness is often incompatible with AIaaS' 'one-size-fits-all' approach, leading to issues and tensions. Specifically, we review and systematise the AIaaS space by proposing a taxonomy of AI services based on the levels of autonomy afforded to the user. We then critically examine the different categories of AIaaS, outlining how these services can lead to biases or be otherwise harmful in the context of end-user applications. In doing so, we seek to draw research attention to the challenges of this emerging area.
New AI classifier for indicating AI-written text
We're launching a classifier trained to distinguish between AI-written and human-written text. We've trained a classifier to distinguish between text written by a human and text written by AIs from a variety of providers. While it is impossible to reliably detect all AI-written text, we believe good classifiers can inform mitigations for false claims that AI-generated text was written by a human: for example, running automated misinformation campaigns, using AI tools for academic dishonesty, and positioning an AI chatbot as a human. Our classifier is not fully reliable. In our evaluations on a "challenge set" of English texts, our classifier correctly identifies 26% of AI-written text (true positives) as "likely AI-written," while incorrectly labeling human-written text as AI-written 9% of the time (false positives).
Deterministic equivalent and error universality of deep random features learning
Schrรถder, Dominik, Cui, Hugo, Dmitriev, Daniil, Loureiro, Bruno
This manuscript considers the problem of learning a random Gaussian network function using a fully connected network with frozen intermediate layers and trainable readout layer. This problem can be seen as a natural generalization of the widely studied random features model to deeper architectures. First, we prove Gaussian universality of the test error in a ridge regression setting where the learner and target networks share the same intermediate layers, and provide a sharp asymptotic formula for it. Establishing this result requires proving a deterministic equivalent for traces of the deep random features sample covariance matrices which can be of independent interest. Second, we conjecture the asymptotic Gaussian universality of the test error in the more general setting of arbitrary convex losses and generic learner/target architectures. We provide extensive numerical evidence for this conjecture, which requires the derivation of closed-form expressions for the layer-wise post-activation population covariances. In light of our results, we investigate the interplay between architecture design and implicit regularization.
Using Machine Learning to Develop Smart Reflex Testing Protocols
McDermott, Matthew, Dighe, Anand, Szolovits, Peter, Luo, Yuan, Baron, Jason
Objective: Reflex testing protocols allow clinical laboratories to perform second line diagnostic tests on existing specimens based on the results of initially ordered tests. Reflex testing can support optimal clinical laboratory test ordering and diagnosis. In current clinical practice, reflex testing typically relies on simple "if-then" rules; however, this limits their scope since most test ordering decisions involve more complexity than a simple rule will allow. Here, using the analyte ferritin as an example, we propose an alternative machine learning-based approach to "smart" reflex testing with a wider scope and greater impact than traditional rule-based approaches. Methods: Using patient data, we developed a machine learning model to predict whether a patient getting CBC testing will also have ferritin testing ordered, consider applications of this model to "smart" reflex testing, and evaluate the model by comparing its performance to possible rule-based approaches. Results: Our underlying machine learning models performed moderately well in predicting ferritin test ordering and demonstrated greater suitability to reflex testing than rule-based approaches. Using chart review, we demonstrate that our model may improve ferritin test ordering. Finally, as a secondary goal, we demonstrate that ferritin test results are missing not at random (MNAR), a finding with implications for unbiased imputation of missing test results. Conclusions: Machine learning may provide a foundation for new types of reflex testing with enhanced benefits for clinical diagnosis and laboratory utilization management.
Learning to be Fair: A Consequentialist Approach to Equitable Decision-Making
Chohlas-Wood, Alex, Coots, Madison, Zhu, Henry, Brunskill, Emma, Goel, Sharad
In the dominant paradigm for designing equitable machine learning systems, one works to ensure that model predictions satisfy various fairness criteria, such as parity in error rates across race, gender, and other legally protected traits. That approach, however, typically ignores the downstream decisions and outcomes that predictions affect, and, as a result, can induce unexpected harms. Here we present an alternative framework for fairness that directly anticipates the consequences of decisions. Stakeholders first specify preferences over the possible outcomes of an algorithmically informed decision-making process. For example, lenders may prefer extending credit to those most likely to repay a loan, while also preferring similar lending rates across neighborhoods. One then searches the space of decision policies to maximize the specified utility. We develop and describe a method for efficiently learning these optimal policies from data for a large family of expressive utility functions, facilitating a more holistic approach to equitable decision-making.
Epic-Sounds: A Large-scale Dataset of Actions That Sound
Huh, Jaesung, Chalk, Jacob, Kazakos, Evangelos, Damen, Dima, Zisserman, Andrew
We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object being placed on a wooden surface), which we verify from visual labels, discarding ambiguities. Overall, EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments. We train and evaluate two state-of-the-art audio recognition models on our dataset, highlighting the importance of audio-only labels and the limitations of current models to recognise actions that sound.
Using novel data and ensemble models to improve automated labeling of Sustainable Development Goals
Wulff, Dirk U., Meier, Dominik S., Mata, Rui
A number of labeling systems based on text have been proposed to help monitor work on the United Nations (UN) Sustainable Development Goals (SDGs). Here, we present a systematic comparison of systems using a variety of text sources and show that systems differ considerably in their specificity (i.e., true-positive rate) and sensitivity (i.e., true-negative rate), have systematic biases (e.g., are more sensitive to specific SDGs relative to others), and are susceptible to the type and amount of text analyzed. We then show that an ensemble model that pools labeling systems alleviates some of these limitations, exceeding the labeling performance of all currently available systems. We conclude that researchers and policymakers should care about the choice of labeling system and that ensemble methods should be favored when drawing conclusions about the absolute and relative prevalence of work on the SDGs based on automated methods.
Convolutional Neural Network for Breast Cancer Classification
Click here to read the full story with my Friend Link! Breast cancer is the second most common cancer in women and men worldwide. In 2012, it represented about 12 percent of all new cancer cases and 25 percent of all cancers in women. Breast cancer starts when cells in the breast begin to grow out of control. These cells usually form a tumor that can often be seen on an x-ray or felt as a lump. The tumor is malignant (cancer) if the cells can grow into (invade) surrounding tissues or spread (metastasize) to distant areas of the body.
CT Study Says Deep Learning Model Could Help Differentiate Between Acute Diverticulitis and Colon Carcinoma
Noting that overlapping imaging features on contrast-enhanced computed tomography (CT) can make it challenging to differentiate between acute diverticulitis and colon cancer, researchers say an emerging deep learning model may provide enhanced sensitivity and specificity for these conditions. In a retrospective study recently published in JAMA Network Open, researchers developed and tested a three-dimensional (3D) convolutional neural network (CNN) for 585 patients (mean age of 63.2) who underwent surgery for colon cancer or acute diverticulitis between July 1, 2005 and October 1, 2020, had venous phase CT imaging within 60 days prior to surgery and had segmental wall thickening in the colon that was independent of disease stage. In comparison to mean sensitivity and specificity rates of 77.6 percent and 81.6 percent, respectively, for radiologist readers, the study authors noted an 83.3 percent sensitivity rate and an 86.6 percent specificity rate for the 3D CNN model. The combination of the deep learning model and radiologist assessment resulted in an eight percent increase in sensitivity (85.6 percent) and a 9.7 percent increase in specificity (91.3 percent) over radiologist assessments, according to the study findings. The study authors also noted the reduction of false-negative rates with the 3D CNN model.