Performance Analysis
Multi-task problems are not multi-objective
Ruchte, Michael, Grabocka, Josif
Multi-objective optimization (MOO) aims at finding a set of optimal configurations for a given set of objectives. A recent line of work applies MOO methods to the typical Machine Learning (ML) setting, which becomes multi-objective if a model should optimize more than one objective, for instance in fair machine learning. These works also use Multi-Task Learning (MTL) problems to benchmark MOO algorithms treating each task as independent objective. In this work we show that MTL problems do not resemble the characteristics of MOO problems. In particular, MTL losses are not competing in case of a sufficiently expressive single model. As a consequence, a single model can perform just as well as optimizing all objectives with independent models, rendering MOO inapplicable. We provide evidence with extensive experiments on the widely used Multi-Fashion-MNIST datasets. Our results call for new benchmarks to evaluate MOO algorithms for ML. Our code is available at: https://github.com/ruchtem/moo-mtl.
Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control
Angelopoulos, Anastasios N., Bates, Stephen, Candรจs, Emmanuel J., Jordan, Michael I., Lei, Lihua
We introduce Learn then Test, a framework for calibrating machine learning models so that their predictions satisfy explicit, finite-sample statistical guarantees regardless of the underlying model and (unknown) data-generating distribution. The framework addresses, among other examples, false discovery rate control in multi-label classification, intersection-over-union control in instance segmentation, and the simultaneous control of the type-1 error of outlier detection and confidence set coverage in classification or regression. To accomplish this, we solve a key technical challenge: the control of arbitrary risks that are not necessarily monotonic. Our main insight is to reframe the risk-control problem as multiple hypothesis testing, enabling techniques and mathematical arguments different from those in the previous literature. We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision.
ML for Security Is Dead. Long Live ML for Security
When it comes to staying on top of security threats, machine learning, unquestionably, must be part of the equation. The volume of data is simply too great to cope without it. But as it's currently being used, ML may be doing more harm than good, particularly when it comes to alarm fatigue. Alarm fatigue is a condition that occurs when an operator is overloaded with alarms; in many of these cases, the majority of alarms turn out to be false positives. With too many alarms to investigate in a limited amount of timeโand the knowledge that most of them are false positivesโthe operator begins to ignore some alarms, which invariably leads to bad outcomes.
E-Commerce Dispute Resolution Prediction
Tsurel, David, Doron, Michael, Nus, Alexander, Dagan, Arnon, Guy, Ido, Shahaf, Dafna
E-Commerce marketplaces support millions of daily transactions, and some disagreements between buyers and sellers are unavoidable. Resolving disputes in an accurate, fast, and fair manner is of great importance for maintaining a trustworthy platform. Simple cases can be automated, but intricate cases are not sufficiently addressed by hard-coded rules, and therefore most disputes are currently resolved by people. In this work we take a first step towards automatically assisting human agents in dispute resolution at scale. We construct a large dataset of disputes from the eBay online marketplace, and identify several interesting behavioral and linguistic patterns. We then train classifiers to predict dispute outcomes with high accuracy. We explore the model and the dataset, reporting interesting correlations, important features, and insights.
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Grauman, Kristen, Westbury, Andrew, Byrne, Eugene, Chavis, Zachary, Furnari, Antonino, Girdhar, Rohit, Hamburger, Jackson, Jiang, Hao, Liu, Miao, Liu, Xingyu, Martin, Miguel, Nagarajan, Tushar, Radosavovic, Ilija, Ramakrishnan, Santhosh Kumar, Ryan, Fiona, Sharma, Jayant, Wray, Michael, Xu, Mengmeng, Xu, Eric Zhongcong, Zhao, Chen, Bansal, Siddhant, Batra, Dhruv, Cartillier, Vincent, Crane, Sean, Do, Tien, Doulaty, Morrie, Erapalli, Akshay, Feichtenhofer, Christoph, Fragomeni, Adriano, Fu, Qichen, Fuegen, Christian, Gebreselasie, Abrham, Gonzalez, Cristina, Hillis, James, Huang, Xuhua, Huang, Yifei, Jia, Wenqi, Khoo, Weslie, Kolar, Jachym, Kottur, Satwik, Kumar, Anurag, Landini, Federico, Li, Chao, Li, Yanghao, Li, Zhenqiang, Mangalam, Karttikeya, Modhugu, Raghava, Munro, Jonathan, Murrell, Tullie, Nishiyasu, Takumi, Price, Will, Puentes, Paola Ruiz, Ramazanova, Merey, Sari, Leda, Somasundaram, Kiran, Southerland, Audrey, Sugano, Yusuke, Tao, Ruijie, Vo, Minh, Wang, Yuchen, Wu, Xindi, Yagi, Takuma, Zhu, Yunyi, Arbelaez, Pablo, Crandall, David, Damen, Dima, Farinella, Giovanni Maria, Ghanem, Bernard, Ithapu, Vamsi Krishna, Jawahar, C. V., Joo, Hanbyul, Kitani, Kris, Li, Haizhou, Newcombe, Richard, Oliva, Aude, Park, Hyun Soo, Rehg, James M., Sato, Yoichi, Shi, Jianbo, Shou, Mike Zheng, Torralba, Antonio, Torresani, Lorenzo, Yan, Mingfei, Malik, Jitendra
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/
HEDP: A Method for Early Forecasting Software Defects based on Human Error Mechanisms
Huang, Fuqun, Strigini, Lorenzo
As the primary cause of software defects, human error is the key to understanding, and perhaps to predicting and avoiding them. Little research has been done to predict defects on the basis of the cognitive errors that cause them. This paper proposes an approach to predicting software defects through knowledge about the cognitive mechanisms of human errors. Our theory is that the main process behind a software defect is that an error-prone scenario triggers human error modes, which psychologists have observed to recur across diverse activities. Software defects can then be predicted by identifying such scenarios, guided by this knowledge of typical error modes. The proposed idea emphasizes predicting the exact location and form of a possible defect. We conducted two case studies to demonstrate and validate this approach, with 55 programmers in a programming competition and 5 analysts serving as the users of the approach. We found it impressive that the approach was able to predict, at the requirement phase, the exact locations and forms of 7 out of the 22 (31.8%) specific types of defects that were found in the code. The defects predicted tended to be common defects: their occurrences constituted 75.7% of the total number of defects in the 55 developed programs; each of them was introduced by at least two persons. The fraction of the defects introduced by a programmer that were predicted was on average (over all programmers) 75%. Furthermore, these predicted defects were highly persistent through the debugging process. If the prediction had been used to successfully prevent these defects, this could have saved 46.2% of the debugging iterations. This excellent capability of forecasting the exact locations and forms of possible defects at the early phases of software development recommends the approach for substantial benefits to defect prevention and early detection.
Logic Constraints to Feature Importances
Picchiotti, Nicola, Gori, Marco
In recent years, Artificial Intelligence (AI) algorithms have been proven to outperform traditional statistical methods in terms of predictivity, especially when a large amount of data was available. Nevertheless, the "black box" nature of AI models is often a limit for a reliable application in high-stakes fields like diagnostic techniques, autonomous guide, etc. Recent works have shown that an adequate level of interpretability could enforce the more general concept of model trustworthiness. The basic idea of this paper is to exploit the human prior knowledge of the features' importance for a specific task, in order to coherently aid the phase of the model's fitting. This sort of "weighted" AI is obtained by extending the empirical loss with a regularization term encouraging the importance of the features to follow predetermined constraints. This procedure relies on local methods for the feature importance computation, e.g. LRP, LIME, etc. that are the link between the model weights to be optimized and the user-defined constraints on feature importance. In the fairness area, promising experimental results have been obtained for the Adult dataset. Many other possible applications of this model agnostic theoretical framework are described.
COVID-19 rapid test national shortage mobilizes White House, leaves experts cautiously optimistic
Fox News Flash top headlines are here. Check out what's clicking on Foxnews.com. Last week's White House report reiterated President Biden's employer mandate that businesses with 100 or more employees require every worker to be fully vaccinated for COVID-19 or tested weekly. Jeffrey Zients, the White House COVID-19 response coordinator, summarized in last week's press briefing that, "We are on track to quadruple the supply of rapid, at-home tests available to Americans by December to more than 200 million a month and to increase the number of places Americans can access free testing in the United States to 30,000 community-based locations." He emphasized the president's staunch commitment in adding $1 billion of extra funding already to the recent $2 billion investment to increase supply.
Tracking the risk of a deployed model and detecting harmful distribution shifts
Podkopaev, Aleksandr, Ramdas, Aaditya
When deployed in the real world, machine learning models inevitably encounter changes in the data distribution, and certain -- but not all -- distribution shifts could result in significant performance degradation. In practice, it may make sense to ignore benign shifts, under which the performance of a deployed model does not degrade substantially, making interventions by a human expert (or model retraining) unnecessary. While several works have developed tests for distribution shifts, these typically either use non-sequential methods, or detect arbitrary shifts (benign or harmful), or both. We argue that a sensible method for firing off a warning has to both (a) detect harmful shifts while ignoring benign ones, and (b) allow continuous monitoring of model performance without increasing the false alarm rate. In this work, we design simple sequential tools for testing if the difference between source (training) and target (test) distributions leads to a significant drop in a risk function of interest, like accuracy or calibration. Recent advances in constructing time-uniform confidence sequences allow efficient aggregation of statistical evidence accumulated during the tracking process. The designed framework is applicable in settings where (some) true labels are revealed after the prediction is performed, or when batches of labels become available in a delayed fashion. We demonstrate the efficacy of the proposed framework through an extensive empirical study on a collection of simulated and real datasets.
The Terminating-Knockoff Filter: Fast High-Dimensional Variable Selection with False Discovery Rate Control
Machkour, Jasin, Muma, Michael, Palomar, Daniel P.
We propose the Terminating-Knockoff (T-Knock) filter, a fast variable selection method for high-dimensional data. The T-Knock filter controls a user-defined target false discovery rate (FDR) while maximizing the number of selected true positives. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original data and multiple sets of randomly generated knockoff variables. A finite sample proof based on martingale theory for the FDR control property is provided. Numerical simulations show that the FDR is controlled at the target level while allowing for a high power. We prove under mild conditions that the knockoffs can be sampled from any univariate distribution. The computational complexity of the proposed method is derived and it is demonstrated via numerical simulations that the sequential computation time is multiple orders of magnitude lower than that of the strongest benchmark methods in sparse high-dimensional settings. The T-Knock filter outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS), while its computation time is more than two orders of magnitude lower than that of the strongest benchmark methods.