Performance Analysis
Decision time: Sometimes accuracy is not your friend
Machine learning is about machines making decisions and, as we have already discussed, we can produce multiple models for any given problem and measure their accuracy. It is intuitively obvious that we would elect to use the most accurate model and most of the time, of course, we do. But there are times when we will actually elect to use one of the less accurate ones. The underlying reason is that the estimates we make of accuracy, whilst very useful, take no account of the cost of being right and being wrong. We might be trying to identify which of our customers on a clothing website are women and which men so that our recommendation engine makes the appropriate clothing suggestions.
Oracle-free Detection of Translation Issue for Neural Machine Translation
Zheng, Wujie, Wang, Wenyu, Liu, Dian, Zhang, Changrong, Zeng, Qinsong, Deng, Yuetang, Yang, Wei, Xie, Tao
Neural Machine Translation (NMT) has been widely adopted over recent years due to its advantages on various translation tasks. However, NMT systems can be error-prone due to the intractability of natural languages and the design of neural networks, bringing issues to their translations. These issues could potentially lead to information loss, wrong semantics, and low readability in translations, compromising the usefulness of NMT and leading to potential non-trivial consequences. Although there are existing approaches, such as using the BLEU score, on quality assessment and issue detection for NMT, such approaches face two serious limitations. First, such solutions require oracle translations, i.e., reference translations, which are often unavailable, e.g., in production environments. Second, such approaches cannot pinpoint the issue types and locations within translations. To address such limitations, we propose a new approach aiming to precisely detect issues in translations without requiring oracle translations. Our approach focuses on two most prominent issues in NMT translations by including two detection algorithms. Our experimental results show that our new approach could achieve high effectiveness on real-world datasets. Our successful experience on deploying the proposed algorithms in both the development and production environments of WeChat, a messenger app with over one billion of monthly active users, helps eliminate numerous defects of our NMT model, monitor the effectiveness on real-world translation tasks, and collect in-house test cases, producing high industry impact.
How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments
Colas, Cédric, Sigaud, Olivier, Oudeyer, Pierre-Yves
Consistently checking the statistical significance of experimental results is one of the mandatory methodological steps to address the so-called "reproducibility crisis" in deep reinforcement learning. In this tutorial paper, we explain how the number of random seeds relates to the probabilities of statistical errors. For both the t-test and the bootstrap confidence interval test, we recall theoretical guidelines to determine the number of random seeds one should use to provide a statistically significant comparison of the performance of two algorithms. Finally, we discuss the influence of deviations from the assumptions usually made by statistical tests. We show that they can lead to inaccurate evaluations of statistical errors and provide guidelines to counter these negative effects. We make our code available to perform the tests.
Ensemble learning with Conformal Predictors: Targeting credible predictions of conversion from Mild Cognitive Impairment to Alzheimer's Disease
Pereira, Telma, Cardoso, Sandra, Silva, Dina, Guerreiro, Manuela, de Mendonça, Alexandre, Madeira, Sara C.
Most machine learning classifiers give predictions for new examples accurately, yet without indicating how trustworthy predictions are. In the medical domain, this hampers their integration in decision support systems, which could be useful in the clinical practice. We use a supervised learning approach that combines Ensemble learning with Conformal Predictors to predict conversion from Mild Cognitive Impairment to Alzheimer's Disease. Our goal is to enhance the classification performance (Ensemble learning) and complement each prediction with a measure of credibility (Conformal Predictors). Our results showed the superiority of the proposed approach over a similar ensemble framework with standard classifiers.
Counterfactual Evaluation of Machine Learning Models
So I'm sure many of you know Stripe. It's a company that provides a platform for e-commerce. And one of the things that everyone encounters when conducting commerce online is, unsurprisingly, fraud. So before I get into the details of how we address fraud with machine learning, I want to talk a little bit about the fraud life cycle. So what typically happens in fraud is that you have an organized crime ring install malware on point-of-sale devices. For example, there was this famous breach at Target about five years ago. So you can actually go online, if you go to the deep web and buy credit card numbers that were taken from personal devices, ATM machines and so forth. What's kind of surprising and funny is that these criminals who are selling credit card numbers to smaller time criminals are quite customer service oriented. So you can say, "I want 12 credit card numbers from Wells Fargo or Citibank. I want credit card numbers that were issued in the zip codes in 94102 to 94105 and so forth." Some of them are in fact so customer serviced oriented that they guarantee you that if you are unable to commit fraud with the cards you buy, they'll give you your money back. Let's say, five years at Stripe was enough for me. I decided to leave and become a criminal, using all my knowledge.
Learning under selective labels in the presence of expert consistency
De-Arteaga, Maria, Dubrawski, Artur, Chouldechova, Alexandra
We explore the problem of learning under selective labels in the context of algorithm-assisted decision making. Selective labels is a pervasive selection bias problem that arises when historical decision making blinds us to the true outcome for certain instances. Examples of this are common in many applications, ranging from predicting recidivism using pre-trial release data to diagnosing patients. In this paper we discuss why selective labels often cannot be effectively tackled by standard methods for adjusting for sample selection bias, even if there are no unobservables. We propose a data augmentation approach that can be used to either leverage expert consistency to mitigate the partial blindness that results from selective labels, or to empirically validate whether learning under such framework may lead to unreliable models prone to systemic discrimination.
Extracting Actionable Knowledge from Domestic Violence Discourses on Social Media
Subramani, Sudha, O'Connor, Manjula
Domestic Violence (DV) is considered as big social issue and there exists a strong relationship between DV and health impacts of the public. Existing research studies have focused on social media to track and analyse real world events like emerging trends, natural disasters, user sentiment analysis, political opinions, and health care. However there is less attention given on social welfare issues like DV and its impact on public health. Recently, the victims of DV turned to social media platforms to express their feelings in the form of posts and seek the social and emotional support, for sympathetic encouragement, to show compassion and empathy among public. But, it is difficult to mine the actionable knowledge from large conversational datasets from social media due to the characteristics of high dimensions, short, noisy, huge volume, high velocity, and so on. Hence, this paper will propose a novel framework to model and discover the various themes related to DV from the public domain. The proposed framework would possibly provide unprecedentedly valuable information to the public health researchers, national family health organizations, government and public with data enrichment and consolidation to improve the social welfare of the community. Thus provides actionable knowledge by monitoring and analysing continuous and rich user generated content.
Breast Cancer Diagnosis via Classification Algorithms
In this paper, we analyze the Wisconsin Diagnostic Breast Cancer Data using Machine Learning classification techniques, such as the SVM, Bayesian Logistic Regression (Variational Approximation), and K-Nearest-Neighbors. We describe each model, and compare their performance through different measures. We conclude that SVM has the best performance among all other classifiers, while it competes closely with the Bayesian Logistic Regression that is ranked second best method for this dataset.
Big data, small lab – Physics World
The Large Hadron Collider at CERN is one of the world's largest scientific instruments. It captures 5 trillion bits of data every second, and the Geneva-based lab employs a dedicated group of experts to manage the flow. In contrast, the instrument shown here – known as a time-stretch quantitative phase imaging microscope – fits on a bench top, and is managed by a team of one. However, it is also capable of capturing an immense amount of data: 0.8 trillion bits per second. These two examples illustrate just how ubiquitous "big data" has become in physics.
A Unified Approach to Quantifying Algorithmic Unfairness: Measuring Individual & Group Unfairness via Inequality Indices
Speicher, Till, Heidari, Hoda, Grgic-Hlaca, Nina, Gummadi, Krishna P., Singla, Adish, Weller, Adrian, Zafar, Muhammad Bilal
Discrimination via algorithmic decision making has received considerable attention. Prior work largely focuses on defining conditions for fairness, but does not define satisfactory measures of algorithmic unfairness. In this paper, we focus on the following question: Given two unfair algorithms, how should we determine which of the two is more unfair? Our core idea is to use existing inequality indices from economics to measure how unequally the outcomes of an algorithm benefit different individuals or groups in a population. Our work offers a justified and general framework to compare and contrast the (un)fairness of algorithmic predictors. This unifying approach enables us to quantify unfairness both at the individual and the group level. Further, our work reveals overlooked tradeoffs between different fairness notions: using our proposed measures, the overall individual-level unfairness of an algorithm can be decomposed into a between-group and a within-group component. Earlier methods are typically designed to tackle only between-group unfairness, which may be justified for legal or other reasons. However, we demonstrate that minimizing exclusively the between-group component may, in fact, increase the within-group, and hence the overall unfairness. We characterize and illustrate the tradeoffs between our measures of (un)fairness and the prediction accuracy.