Accuracy
Sequential algorithmic modification with test data reuse
Feng, Jean, Pennello, Gene, Petrick, Nicholas, Sahiner, Berkman, Pirracchio, Romain, Gossmann, Alexej
After initial release of a machine learning algorithm, the model can be fine-tuned by retraining on subsequently gathered data, adding newly discovered features, or more. Each modification introduces a risk of deteriorating performance and must be validated on a test dataset. It may not always be practical to assemble a new dataset for testing each modification, especially when most modifications are minor or are implemented in rapid succession. Recent works have shown how one can repeatedly test modifications on the same dataset and protect against overfitting by (i) discretizing test results along a grid and (ii) applying a Bonferroni correction to adjust for the total number of modifications considered by an adaptive developer. However, the standard Bonferroni correction is overly conservative when most modifications are beneficial and/or highly correlated. This work investigates more powerful approaches using alpha-recycling and sequentially-rejective graphical procedures (SRGPs). We introduce novel extensions that account for correlation between adaptively chosen algorithmic modifications. In empirical analyses, the SRGPs control the error rate of approving unacceptable modifications and approve a substantially higher number of beneficial modifications than previous approaches.
Benchmarking emergency department triage prediction models with machine learning and large public electronic health records
Xie, Feng, Zhou, Jun, Lee, Jin Wee, Tan, Mingrui, Li, Siqi, Rajnthern, Logasan S/O, Chee, Marcel Lucas, Chakraborty, Bibhas, Wong, An-Kwok Ian, Dagan, Alon, Ong, Marcus Eng Hock, Gao, Fei, Liu, Nan
The demand for emergency department (ED) services is increasing across the globe, particularly during the current COVID-19 pandemic. Clinical triage and risk assessment have become increasingly challenging due to the shortage of medical resources and the strain on hospital infrastructure caused by the pandemic. As a result of the widespread use of electronic health records (EHRs), we now have access to a vast amount of clinical data, which allows us to develop predictive models and decision support systems to address these challenges. To date, however, there are no widely accepted benchmark ED triage prediction models based on large-scale public EHR data. An open-source benchmarking platform would streamline research workflows by eliminating cumbersome data preprocessing, and facilitate comparisons among different studies and methodologies. In this paper, based on the Medical Information Mart for Intensive Care IV Emergency Department (MIMIC-IV-ED) database, we developed a publicly available benchmark suite for ED triage predictive models and created a benchmark dataset that contains over 400,000 ED visits from 2011 to 2019. We introduced three ED-based outcomes (hospitalization, critical outcomes, and 72-hour ED reattendance) and implemented a variety of popular methodologies, ranging from machine learning methods to clinical scoring systems. We evaluated and compared the performance of these methods against benchmark tasks. Our codes are open-source, allowing anyone with MIMIC-IV-ED data access to perform the same steps in data processing, benchmark model building, and experiments. This study provides future researchers with insights, suggestions, and protocols for managing raw data and developing risk triaging tools for emergency care.
How I'm using Machine Learning to Trade in the Stock Market
Disclaimer: This article is about a simple strategy that I have used to create a trading bot. While back-testing shows that the trading bot is profitable, the trading bot is not capable of handling "black swan" events such as market crashes. Also I am not a financial advisor nor a professional trader. I am simply sharing this for entertainment purposes. So trade & read at your own risk.
Machine Learning for Encrypted Malicious Traffic Detection: Approaches, Datasets and Comparative Study
Wang, Zihao, Fok, Kar-Wai, Thing, Vrizlynn L. L.
As people's demand for personal privacy and data security becomes a priority, encrypted traffic has become mainstream in the cyber world. However, traffic encryption is also shielding malicious and illegal traffic introduced by adversaries, from being detected. This is especially so in the post-COVID-19 environment where malicious traffic encryption is growing rapidly. Common security solutions that rely on plain payload content analysis such as deep packet inspection are rendered useless. Thus, machine learning based approaches have become an important direction for encrypted malicious traffic detection. In this paper, we formulate a universal framework of machine learning based encrypted malicious traffic detection techniques and provided a systematic review. Furthermore, current research adopts different datasets to train their models due to the lack of well-recognized datasets and feature sets. As a result, their model performance cannot be compared and analyzed reliably. Therefore, in this paper, we analyse, process and combine datasets from 5 different sources to generate a comprehensive and fair dataset to aid future research in this field. On this basis, we also implement and compare 10 encrypted malicious traffic detection algorithms. We then discuss challenges and propose future directions of research.
GAM(L)A: An econometric model for interpretable Machine Learning
Flachaire, Emmanuel, Hacheme, Gilles, Huรฉ, Sullivan, Laurent, Sรฉbastien
Despite their high predictive performance, random forest and gradient boosting are often considered as black boxes or uninterpretable models which has raised concerns from practitioners and regulators. As an alternative, we propose in this paper to use partial linear models that are inherently interpretable. Specifically, this article introduces GAM-lasso (GAMLA) and GAM-autometrics (GAMA), denoted as GAM(L)A in short. GAM(L)A combines parametric and non-parametric functions to accurately capture linearities and non-linearities prevailing between dependent and explanatory variables, and a variable selection procedure to control for overfitting issues. Estimation relies on a two-step procedure building upon the double residual method. We illustrate the predictive performance and interpretability of GAM(L)A on a regression and a classification problem. The results show that GAM(L)A outperforms parametric models augmented by quadratic, cubic and interaction effects. Moreover, the results also suggest that the performance of GAM(L)A is not significantly different from that of random forest and gradient boosting.
Learning Distributionally Robust Models at Scale via Composite Optimization
Haddadpour, Farzin, Kamani, Mohammad Mahdi, Mahdavi, Mehrdad, Karbasi, Amin
To train machine learning models that are robust to distribution shifts in the data, distributionally robust optimization (DRO) has been proven very effective. However, the existing approaches to learning a distributionally robust model either require solving complex optimization problems such as semidefinite programming or a first-order method whose convergence scales linearly with the number of data samples -- which hinders their scalability to large datasets. In this paper, we show how different variants of DRO are simply instances of a finite-sum composite optimization for which we provide scalable methods. We also provide empirical results that demonstrate the effectiveness of our proposed algorithm with respect to the prior art in order to learn robust models from very large datasets.
'False Positive': These Exoplanets Are Actually Stars
Thousands of exoplanets have been discovered, but researchers now learned that several of them aren't actually planets. Using updated measurement methods, they found that the objects are actually stars. Exoplanets are the planets outside our solar system, whether free-floating or orbiting a star. Thousands of them have been discovered since they were first spotted in the 1990s. So far, almost 5,000 exoplanets have been confirmed, while 5,000 more are planetary candidates, or the ones that may be planets but haven't been confirmed, the Massachusetts Institute of Technology (MIT) noted.
A Continual Learning Framework for Adaptive Defect Classification and Inspection
Sun, Wenbo, Kontar, Raed Al, Jin, Judy, Chang, Tzyy-Shuh
Recent development of advanced sensing and high computing technologies has enabled the wide adoption of machine vision to automatically inspect products' dimensional quality for efficient process control and reducing the manual inspection cost. The process control procedure requires effective data analysis methods to provide reliable inspection results. In this paper, we consider a high-volume manufacturing system that uses machine vision at the quality inspection station for automatic classification of product defects. Here classification implies both; identifying a defect and classifying its corresponding type. As a motivating example, we consider the scenario where batches of three-dimensional (3D) point cloud data are independently collected from a manufacturing process. The 3D point cloud data is obtained by measuring the 3D location of points on the product surface using a 3D scanner. The location measurements can then be used for fast classification of surface defects, and thus provide timely feedback for process control. Figure 1 (right) shows some exemplar surface defects on a wood product and the corresponding 3D point cloud measurements. The 3D point cloud measurements have a set of defining characteristics that should be considered in the development of defect classification techniques.
High dimensional change-point detection: a complete graph approach
Sun, Yang-Wen, Papagiannouli, Katerina, Spokoiny, Vladimir
The aim of online change-point detection is for a accurate, timely discovery of structural breaks. As data dimension outgrows the number of data in observation, online detection becomes challenging. Existing methods typically test only the change of mean, which omit the practical aspect of change of variance. We propose a complete graph-based, change-point detection algorithm to detect change of mean and variance from low to high-dimensional online data with a variable scanning window. Inspired by complete graph structure, we introduce graph-spanning ratios to map high-dimensional data into metrics, and then test statistically if a change of mean or change of variance occurs. Theoretical study shows that our approach has the desirable pivotal property and is powerful with prescribed error probabilities. We demonstrate that this framework outperforms other methods in terms of detection power. Our approach has high detection power with small and multiple scanning window, which allows timely detection of change-point in the online setting. Finally, we applied the method to financial data to detect change-points in S&P 500 stocks.
Machine Learning Reimagines the Building Blocks of Computing
Like tiny gears inside a watch, algorithms execute well-defined tasks within more complicated programs. They're ubiquitous, and in part because of this, they've been painstakingly optimized over time. When a programmer needs to sort a list, for example, they'll reach for a standard "sort" algorithm that's been used for decades. Now researchers are taking a fresh look at traditional algorithms, using the branch of artificial intelligence known as machine learning. Their approach, called algorithms with predictions, takes advantage of the insights machine learning tools can provide into the data that traditional algorithms handle.