Performance Analysis
Addressing Missing Sources with Adversarial Support-Matching
Kehrenberg, Thomas, Bartlett, Myles, Sharmanska, Viktoriia, Quadrianto, Novi
When trained on diverse labeled data, machine learning models have proven themselves to be a powerful tool in all facets of society. However, due to budget limitations, deliberate or non-deliberate censorship, and other problems during data collection and curation, the labeled training set might exhibit a systematic shortage of data for certain groups. We investigate a scenario in which the absence of certain data is linked to the second level of a two-level hierarchy in the data. Inspired by the idea of protected groups from algorithmic fairness, we refer to the partitions carved by this second level as "subgroups"; we refer to combinations of subgroups and classes, or leaves of the hierarchy, as "sources". To characterize the problem, we introduce the concept of classes with incomplete subgroup support. The representational bias in the training set can give rise to spurious correlations between the classes and the subgroups which render standard classification models ungeneralizable to unseen sources. To overcome this bias, we make use of an additional, diverse but unlabeled dataset, called the "deployment set", to learn a representation that is invariant to subgroup. This is done by adversarially matching the support of the training and deployment sets in representation space. In order to learn the desired invariance, it is paramount that the sets of samples observed by the discriminator are balanced by class; this is easily achieved for the training set, but requires using semi-supervised clustering for the deployment set. We demonstrate the effectiveness of our method with experiments on several datasets and variants of the problem.
All About Logistic Regression
Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. Logistic Regression is a Supervised Machine Learning algorithm that is used in classification problems where we have to distinguish the dependent variable between two or more categories or classes by using the independent variables.
Sequential algorithmic modification with test data reuse
Feng, Jean, Pennello, Gene, Petrick, Nicholas, Sahiner, Berkman, Pirracchio, Romain, Gossmann, Alexej
After initial release of a machine learning algorithm, the model can be fine-tuned by retraining on subsequently gathered data, adding newly discovered features, or more. Each modification introduces a risk of deteriorating performance and must be validated on a test dataset. It may not always be practical to assemble a new dataset for testing each modification, especially when most modifications are minor or are implemented in rapid succession. Recent works have shown how one can repeatedly test modifications on the same dataset and protect against overfitting by (i) discretizing test results along a grid and (ii) applying a Bonferroni correction to adjust for the total number of modifications considered by an adaptive developer. However, the standard Bonferroni correction is overly conservative when most modifications are beneficial and/or highly correlated. This work investigates more powerful approaches using alpha-recycling and sequentially-rejective graphical procedures (SRGPs). We introduce novel extensions that account for correlation between adaptively chosen algorithmic modifications. In empirical analyses, the SRGPs control the error rate of approving unacceptable modifications and approve a substantially higher number of beneficial modifications than previous approaches.
Benchmarking emergency department triage prediction models with machine learning and large public electronic health records
Xie, Feng, Zhou, Jun, Lee, Jin Wee, Tan, Mingrui, Li, Siqi, Rajnthern, Logasan S/O, Chee, Marcel Lucas, Chakraborty, Bibhas, Wong, An-Kwok Ian, Dagan, Alon, Ong, Marcus Eng Hock, Gao, Fei, Liu, Nan
The demand for emergency department (ED) services is increasing across the globe, particularly during the current COVID-19 pandemic. Clinical triage and risk assessment have become increasingly challenging due to the shortage of medical resources and the strain on hospital infrastructure caused by the pandemic. As a result of the widespread use of electronic health records (EHRs), we now have access to a vast amount of clinical data, which allows us to develop predictive models and decision support systems to address these challenges. To date, however, there are no widely accepted benchmark ED triage prediction models based on large-scale public EHR data. An open-source benchmarking platform would streamline research workflows by eliminating cumbersome data preprocessing, and facilitate comparisons among different studies and methodologies. In this paper, based on the Medical Information Mart for Intensive Care IV Emergency Department (MIMIC-IV-ED) database, we developed a publicly available benchmark suite for ED triage predictive models and created a benchmark dataset that contains over 400,000 ED visits from 2011 to 2019. We introduced three ED-based outcomes (hospitalization, critical outcomes, and 72-hour ED reattendance) and implemented a variety of popular methodologies, ranging from machine learning methods to clinical scoring systems. We evaluated and compared the performance of these methods against benchmark tasks. Our codes are open-source, allowing anyone with MIMIC-IV-ED data access to perform the same steps in data processing, benchmark model building, and experiments. This study provides future researchers with insights, suggestions, and protocols for managing raw data and developing risk triaging tools for emergency care.
How I'm using Machine Learning to Trade in the Stock Market
Disclaimer: This article is about a simple strategy that I have used to create a trading bot. While back-testing shows that the trading bot is profitable, the trading bot is not capable of handling "black swan" events such as market crashes. Also I am not a financial advisor nor a professional trader. I am simply sharing this for entertainment purposes. So trade & read at your own risk.
Machine Learning for Encrypted Malicious Traffic Detection: Approaches, Datasets and Comparative Study
Wang, Zihao, Fok, Kar-Wai, Thing, Vrizlynn L. L.
As people's demand for personal privacy and data security becomes a priority, encrypted traffic has become mainstream in the cyber world. However, traffic encryption is also shielding malicious and illegal traffic introduced by adversaries, from being detected. This is especially so in the post-COVID-19 environment where malicious traffic encryption is growing rapidly. Common security solutions that rely on plain payload content analysis such as deep packet inspection are rendered useless. Thus, machine learning based approaches have become an important direction for encrypted malicious traffic detection. In this paper, we formulate a universal framework of machine learning based encrypted malicious traffic detection techniques and provided a systematic review. Furthermore, current research adopts different datasets to train their models due to the lack of well-recognized datasets and feature sets. As a result, their model performance cannot be compared and analyzed reliably. Therefore, in this paper, we analyse, process and combine datasets from 5 different sources to generate a comprehensive and fair dataset to aid future research in this field. On this basis, we also implement and compare 10 encrypted malicious traffic detection algorithms. We then discuss challenges and propose future directions of research.
GAM(L)A: An econometric model for interpretable Machine Learning
Flachaire, Emmanuel, Hacheme, Gilles, Huรฉ, Sullivan, Laurent, Sรฉbastien
Despite their high predictive performance, random forest and gradient boosting are often considered as black boxes or uninterpretable models which has raised concerns from practitioners and regulators. As an alternative, we propose in this paper to use partial linear models that are inherently interpretable. Specifically, this article introduces GAM-lasso (GAMLA) and GAM-autometrics (GAMA), denoted as GAM(L)A in short. GAM(L)A combines parametric and non-parametric functions to accurately capture linearities and non-linearities prevailing between dependent and explanatory variables, and a variable selection procedure to control for overfitting issues. Estimation relies on a two-step procedure building upon the double residual method. We illustrate the predictive performance and interpretability of GAM(L)A on a regression and a classification problem. The results show that GAM(L)A outperforms parametric models augmented by quadratic, cubic and interaction effects. Moreover, the results also suggest that the performance of GAM(L)A is not significantly different from that of random forest and gradient boosting.
Learning Distributionally Robust Models at Scale via Composite Optimization
Haddadpour, Farzin, Kamani, Mohammad Mahdi, Mahdavi, Mehrdad, Karbasi, Amin
To train machine learning models that are robust to distribution shifts in the data, distributionally robust optimization (DRO) has been proven very effective. However, the existing approaches to learning a distributionally robust model either require solving complex optimization problems such as semidefinite programming or a first-order method whose convergence scales linearly with the number of data samples -- which hinders their scalability to large datasets. In this paper, we show how different variants of DRO are simply instances of a finite-sum composite optimization for which we provide scalable methods. We also provide empirical results that demonstrate the effectiveness of our proposed algorithm with respect to the prior art in order to learn robust models from very large datasets.
'False Positive': These Exoplanets Are Actually Stars
Thousands of exoplanets have been discovered, but researchers now learned that several of them aren't actually planets. Using updated measurement methods, they found that the objects are actually stars. Exoplanets are the planets outside our solar system, whether free-floating or orbiting a star. Thousands of them have been discovered since they were first spotted in the 1990s. So far, almost 5,000 exoplanets have been confirmed, while 5,000 more are planetary candidates, or the ones that may be planets but haven't been confirmed, the Massachusetts Institute of Technology (MIT) noted.
A Continual Learning Framework for Adaptive Defect Classification and Inspection
Sun, Wenbo, Kontar, Raed Al, Jin, Judy, Chang, Tzyy-Shuh
Recent development of advanced sensing and high computing technologies has enabled the wide adoption of machine vision to automatically inspect products' dimensional quality for efficient process control and reducing the manual inspection cost. The process control procedure requires effective data analysis methods to provide reliable inspection results. In this paper, we consider a high-volume manufacturing system that uses machine vision at the quality inspection station for automatic classification of product defects. Here classification implies both; identifying a defect and classifying its corresponding type. As a motivating example, we consider the scenario where batches of three-dimensional (3D) point cloud data are independently collected from a manufacturing process. The 3D point cloud data is obtained by measuring the 3D location of points on the product surface using a 3D scanner. The location measurements can then be used for fast classification of surface defects, and thus provide timely feedback for process control. Figure 1 (right) shows some exemplar surface defects on a wood product and the corresponding 3D point cloud measurements. The 3D point cloud measurements have a set of defining characteristics that should be considered in the development of defect classification techniques.