Performance Analysis
Model evaluation, model selection, and algorithm selection in machine learning
A single-PDF version of Model Evaluation parts 1-4 is available on arXiv: https://arxiv.org/abs/1811.12808 Almost every machine learning algorithm comes with a large number of settings that we, the machine learning researchers and practitioners, need to specify. These tuning knobs, the so-called hyperparameters, help us control the behavior of machine learning algorithms when optimizing for performance, finding the right balance between bias and variance. Hyperparameter tuning for performance optimization is an art in itself, and there are no hard-and-fast rules that guarantee best performance on a given dataset. In Part I and Part II, we saw different holdout and bootstrap techniques for estimating the generalization performance of a model. We learned about the bias-variance trade-off, and we computed the uncertainty of our estimates. In this third part, we will focus on different methods of cross-validation for model evaluation and model selection. We will use these cross-validation techniques to rank models from several hyperparameter configurations and estimate how well they generalize to independent datasets.
Pedestrian Behavior Prediction for Automated Driving: Requirements, Metrics, and Relevant Features
Herman, Michael, Wagner, Jörg, Prabhakaran, Vishnu, Möser, Nicolas, Ziesche, Hanna, Ahmed, Waleed, Bürkle, Lutz, Kloppenburg, Ernst, Gläser, Claudius
Automated vehicles require a comprehensive understanding of traffic situations to ensure safe and anticipatory driving. In this context, the prediction of pedestrians is particularly challenging as pedestrian behavior can be influenced by multiple factors. In this paper, we thoroughly analyze the requirements on pedestrian behavior prediction for automated driving via a system-level approach. To this end we investigate real-world pedestrian-vehicle interactions with human drivers. Based on human driving behavior we then derive appropriate reaction patterns of an automated vehicle and determine requirements for the prediction of pedestrians. This includes a novel metric tailored to measure prediction performance from a system-level perspective. The proposed metric is evaluated on a large-scale dataset comprising thousands of real-world pedestrian-vehicle interactions. We furthermore conduct an ablation study to evaluate the importance of different contextual cues and compare these results to ones obtained using established performance metrics for pedestrian prediction. Our results highlight the importance of a system-level approach to pedestrian behavior prediction.
Noise-Augmented Privacy-Preserving Empirical Risk Minimization with Dual-purpose Regularizer and Privacy Budget Retrieval and Recycling
Empirical risk minimization (ERM) is a principle in statistical learning. Through ERM, we can measure the performance of a family of learning algorithms based on a set of observed training data empirically without knowing the true distribution of the data and derive theoretical bounds on the performance. ERM is routinely applied in a wide range of learning problems such as regression, classification, and clustering. In recent years, with the increasing popularity in privacy-preserving machine learning that satisfies formal privacy guarantees such as differential privacy (DP) [10], the topic of privacy-preserving ERM has also been investigated. Generally speaking, differentially private empirical risk minimization (DP-ERM) can be realized by perturbing the output (estimation or prediction), the objective function (input), or iteratively during the algorithmic optimization, given an ERM problem. For output perturbation, randomization mechanisms need to be applied every time a new output is released; for iterative algorithmic perturbation, each iteration incurs a privacy loss, careful planning and implementation of privacy accounting methods to minimize the overall privacy loss is critical. In this paper, we focus on differentially private perturbation of objective functions. Once an objective function is perturbed, the subsequent optimization does not incur additional privacy loss and all outputs generated from the optimization are also differentially private.
Population modeling with machine learning can enhance measures of mental health
Figure 1 – Figure supplement 1: Learning curves on the random split-half validation used for model building. To facilitate comparisons, we evaluated predictions of age, fluid intelligence and neuroticism from a complete set of socio-demographic variables without brain imaging using the coefficient of determination R2 metric (y-axis) to compare results obtained from 100 to 3000 training samples (x-axis). The cross-validation (CV) distribution was obtained from 100 Monte Carlo splits. Across targets, performance started to plateau after around 1000 training samples with scores virtually identical to the final model used in subsequent analyses. These benchmarks suggest that inclusion of additional training samples would not have led to substantial improvements in performance.
Covid-19: Warning over false negatives and rapid tests allowed for half-term break
New rules allowing travellers returning to England to take lateral flow tests instead of more expensive PCR tests will come into force on 24 October, in time for many families returning from half-term breaks. The government says NHS tests cannot be used for overseas travel but fully vaccinated passengers arriving in England from that date will be able to order tests from approved providers and upload photos of results for verification. Scotland, Wales and Northern Ireland have previously aligned with policy in England.
@Radiology_AI
To develop an algorithm to classify postcontrast T1-weighted MRI scans by tumor classes (high-grade glioma, low-grade glioma [LGG], brain metastasis, meningioma, pituitary adenoma, and acoustic neuroma) and a healthy tissue (HLTH) class. In this retrospective study, preoperative postcontrast T1-weighted MR scans from four publicly available datasets--the Brain Tumor Image Segmentation dataset (n 378), the LGG-1p19q dataset (n 145), The Cancer Genome Atlas Glioblastoma Multiforme dataset (n 141), and The Cancer Genome Atlas Low Grade Glioma dataset (n 68)--and an internal clinical dataset (n 1373) were used. In all, a total of 2105 images were split into a training dataset (n 1396), an internal test set (n 361), and an external test dataset (n 348). A convolutional neural network was trained to classify the tumor type and to discriminate between images depicting HLTH and images depicting tumors. The performance of the model was evaluated by using cross-validation, internal testing, and external testing.
Adversarial Attacks on ML Defense Models Competition
Dong, Yinpeng, Fu, Qi-An, Yang, Xiao, Xiang, Wenzhao, Pang, Tianyu, Su, Hang, Zhu, Jun, Tang, Jiayu, Chen, Yuefeng, Mao, XiaoFeng, He, Yuan, Xue, Hui, Li, Chao, Liu, Ye, Zhang, Qilong, Gao, Lianli, Yu, Yunrui, Gao, Xitong, Zhao, Zhe, Lin, Daquan, Lin, Jiadong, Song, Chuanbiao, Wang, Zihao, Wu, Zhennan, Guo, Yang, Cui, Jiequan, Xu, Xiaogang, Chen, Pengguang
Due to the vulnerability of deep neural networks (DNNs) to adversarial examples, a large number of defense techniques have been proposed to alleviate this problem in recent years. However, the progress of building more robust models is usually hampered by the incomplete or incorrect robustness evaluation. To accelerate the research on reliable evaluation of adversarial robustness of the current defense models in image classification, the TSAIL group at Tsinghua University and the Alibaba Security group organized this competition along with a CVPR 2021 workshop on adversarial machine learning (https://aisecure-workshop.github.io/amlcvpr2021/). The purpose of this competition is to motivate novel attack algorithms to evaluate adversarial robustness more effectively and reliably. The participants were encouraged to develop stronger white-box attack algorithms to find the worst-case robustness of different defenses. This competition was conducted on an adversarial robustness evaluation platform -- ARES (https://github.com/thu-ml/ares), and is held on the TianChi platform (https://tianchi.aliyun.com/competition/entrance/531847/introduction) as one of the series of AI Security Challengers Program. After the competition, we summarized the results and established a new adversarial robustness benchmark at https://ml.cs.tsinghua.edu.cn/ares-bench/, which allows users to upload adversarial attack algorithms and defense models for evaluation.
Online Control of the False Discovery Rate under "Decision Deadlines"
Scientific discoveries form an ongoing, ever-evolving process. Each new experiment offers an opportunity to suggest new hypotheses based on results that have come before. Traditionally, the hypotheses researchers plan to test in an experiment are prespecified before any data from the experiment is visible, as this facilitates control of either the false discovery rate (FDR; Benjamini and Hochberg, 1995) or the probability of producing any false positives (the familywise error rate, or FWER; see, for example Efron and Hastie, 2016) within that experiment. In contrast to fully prespecified procedures, online procedures test hypotheses sequentially, and allow the results of preliminary tests to inform choices about which hypotheses to focus on in future tests (Foster and Stine, 2008). These procedures typically require that error rates be controlled at every stage of the sequence (e.g., Javanmard and Montanari, 2015; Ramdas et al., 2017).
NLP Methods for Extraction of Symptoms from Unstructured Data for Use in Prognostic COVID-19 Analytic Models
Silverman, Greg M. | Sahoo, Himanshu S. (NLP/IE Program, Department of Electrical and Computer Engineering, University of Minnesota) | Ingraham, Nicholas E. (Division of Pulmonary, Allergy, Critical Care, and Sleep Medicine, University of Minnesota) | Lupei, Monica (Division of Critical Care, Department of Anesthesiology, University of Minnesota) | Puskarich, Michael A. (Department of Emergency Medicine, University of Minnesota) | Usher, Michael (Department of Medicine, University of Minnesota) | Dries, James (University of Minnesota) | Finzel, Raymond L. (NLP/IE Program, College of Pharmacy, University of Minnesota) | Murray, Eric (Information Technology, M Health Fairview) | Sartori, John (Department of Electrical and Computer Engineering, University of Minnesota) | Simon, Gyorgy (Institute for Health Informatics, University of Minnesota ) | Zhang, Rui | Melton, Genevieve B. (NLP/IE Program, Department of Surgery, and Institute for Health Informatics, University of Minnesota, Fairview Health Services, Information Technology) | Tignanelli, Christopher J. (NLP/IE Program, Department of Surgery, University of Minnesota ) | Pakhomov, Serguei VS (NLP/IE Program, College of Pharmacy, University of Minnesota )
Statistical modeling of outcomes based on a patient's presenting symptoms (symptomatology) can help deliver high quality care and allocate essential resources, which is especially important during the COVID-19 pandemic. Patient symptoms are typically found in unstructured notes, and thus not readily available for clinical decision making. In an attempt to fill this gap, this study compared two methods for symptom extraction from Emergency Department (ED) admission notes. Both methods utilized a lexicon derived by expanding The Center for Disease Control and Prevention's (CDC) Symptoms of Coronavirus list. The first method utilized a word2vec model to expand the lexicon using a dictionary mapping to the Uni ed Medical Language System (UMLS). The second method utilized the expanded lexicon as a rule-based gazetteer and the UMLS. These methods were evaluated against a manually annotated reference (f1-score of 0.87 for UMLS-based ensemble; and 0.85 for rule-based gazetteer with UMLS). Through analyses of associations of extracted symptoms used as features against various outcomes, salient risks among the population of COVID-19 patients, including increased risk of in-hospital mortality (OR 1.85, p-value < 0.001), were identified for patients presenting with dyspnea. Disparities between English and non-English speaking patients were also identified, the most salient being a concerning finding of opposing risk signals between fatigue and in-hospital mortality (non-English: OR 1.95, p-value = 0.02; English: OR 0.63, p-value = 0.01). While use of symptomatology for modeling of outcomes is not unique, unlike previous studies this study showed that models built using symptoms with the outcome of in-hospital mortality were not significantly different from models using data collected during an in-patient encounter (AUC of 0.9 with 95% CI of [0.88, 0.91] using only vital signs; AUC of 0.87 with 95% CI of [0.85, 0.88] using only symptoms). These findings indicate that prognostic models based on symptomatology could aid in extending COVID-19 patient care through telemedicine, replacing the need for in-person options. The methods presented in this study have potential for use in development of symptomatology-based models for other diseases, including for the study of Post-Acute Sequelae of COVID-19 (PASC).
Continuous Authentication Using Mouse Movements, Machine Learning, and Minecraft
Siddiqui, Nyle, Dave, Rushit, Seliya, Naeem
Mouse dynamics has grown in popularity as a novel irreproducible behavioral biometric. Datasets which contain general unrestricted mouse movements from users are sparse in the current literature. The Balabit mouse dynamics dataset produced in 2016 was made for a data science competition and despite some of its shortcomings, is considered to be the first publicly available mouse dynamics dataset. Collecting mouse movements in a dull administrative manner as Balabit does may unintentionally homogenize data and is also not representative of realworld application scenarios. This paper presents a novel mouse dynamics dataset that has been collected while 10 users play the video game Minecraft on a desktop computer. Binary Random Forest (RF) classifiers are created for each user to detect differences between a specific users movements and an imposters movements. Two evaluation scenarios are proposed to evaluate the performance of these classifiers; one scenario outperformed previous works in all evaluation metrics, reaching average accuracy rates of 92%, while the other scenario successfully reported reduced instances of false authentications of imposters.