Performance Analysis
Boosting and Bagging: How To Develop A Robust Machine Learning Algorithm
Machine learning and data science require more than just throwing data into a python library and utilizing whatever comes out. Data scientists need to actually understand the data and the processes behind the data to be able to implement a successful system. One key methodology to implementation is knowing when a model might benefit from utilizing bootstrapping methods. These are what are called ensemble models. Some examples of ensemble models are AdaBoost and Stochastic Gradient Boosting.
Predicting Sequences of Traversed Nodes in Graphs using Network Models with Multiple Higher Orders
Gote, Christoph, Casiraghi, Giona, Schweitzer, Frank, Scholtes, Ingo
We propose a novel sequence prediction method for sequential data capturing node traversals in graphs. Our method builds on a statistical modelling framework that combines multiple higher-order network models into a single multi-order model. We develop a technique to fit such multi-order models in empirical sequential data and to select the optimal maximum order. Our framework facilitates both next-element and full sequence prediction given a sequence-prefix of any length. We evaluate our model based on six empirical data sets containing sequences from website navigation as well as public transport systems. The results show that our method out-performs state-of-the-art algorithms for next-element prediction. We further demonstrate the accuracy of our method during out-of-sample sequence prediction and validate that our method can scale to data sets with millions of sequences.
Exploiting Uncertainties from Ensemble Learners to Improve Decision-Making in Healthcare AI
Tan, Yingshui, Jin, Baihong, Yue, Xiangyu, Chen, Yuxin, Vincentelli, Alberto Sangiovanni
Ensemble learning is widely applied in Machine Learning (ML) to improve model performance and to mitigate decision risks. In this approach, predictions from a diverse set of learners are combined to obtain a joint decision. Recently, various methods have been explored in literature for estimating decision uncertainties using ensemble learning; however, determining which metrics are a better fit for certain decision-making applications remains a challenging task. In this paper, we study the following key research question in the selection of uncertainty metrics: when does an uncertainty metric outperforms another? We answer this question via a rigorous analysis of two commonly used uncertainty metrics in ensemble learning, namely ensemble mean and ensemble variance. We show that, under mild assumptions on the ensemble learners, ensemble mean is preferable with respect to ensemble variance as an uncertainty metric for decision making.
The impact of machine learning and AI on the UK economy
A recent virtual event addressed another such issue: the potential impact machines, imbued with artificial intelligence, may have on the economy and the financial system. The event was organised by the Bank of England, in collaboration with CEPR and the Brevan Howard Centre for Financial Analysis at Imperial College. What follows is a summary of some of the recorded presentations. The full catalogue of videos are available on the Bank of England's website. In his presentation, Stuart Russell (University of California, Berkeley), author of the leading textbook on artificial intelligence (AI), gives a broad historical overview of the field since its emergence in the 1950s, followed by insight into more recent developments.
On Improving Hotspot Detection Through Synthetic Pattern-Based Database Enhancement
Reddy, Gaurav Rajavendra, Xanthopoulos, Constantinos, Makris, Yiorgos
Continuous technology scaling and the introduction of advanced technology nodes in Integrated Circuit (IC) fabrication is constantly exposing new manufacturability issues. One such issue, stemming from complex interaction between design and process, is the problem of design hotspots. Such hotspots are known to vary from design to design and, ideally, should be predicted early and corrected in the design stage itself, as opposed to relying on the foundry to develop process fixes for every hotspot, which would be intractable. In the past, various efforts have been made to address this issue by using a known database of hotspots as the source of information. The majority of these efforts use either Machine Learning (ML) or Pattern Matching (PM) techniques to identify and predict hotspots in new incoming designs. However, almost all of them suffer from high false-alarm rates, mainly because they are oblivious to the root causes of hotspots. In this work, we seek to address this limitation by using a novel database enhancement approach through synthetic pattern generation based on carefully crafted Design of Experiments (DOEs). Effectiveness of the proposed method against the state-of-the-art is evaluated on a 45nm process using industry-standard tools and designs.
Deep Contextual Clinical Prediction with Reverse Distillation
Kodialam, Rohan S., Boiarsky, Rebecca, Sontag, David
Healthcare providers are increasingly using learned methods to predict and understand long-term patient outcomes in order to make meaningful interventions. However, despite innovations in this area, deep learning models often struggle to match performance of shallow linear models in predicting these outcomes, making it difficult to leverage such techniques in practice. In this work, motivated by the task of clinical prediction from insurance claims, we present a new technique called reverse distillation which pretrains deep models by using high-performing linear models for initialization. We make use of the longitudinal structure of insurance claims datasets to develop Self Attention with Reverse Distillation, or SARD, an architecture that utilizes a combination of contextual embedding, temporal embedding and self-attention mechanisms and most critically is trained via reverse distillation. SARD outperforms state-of-the-art methods on multiple clinical prediction outcomes, with ablation studies revealing that reverse distillation is a primary driver of these improvements.
Contrastive Training for Improved Out-of-Distribution Detection
Winkens, Jim, Bunel, Rudy, Roy, Abhijit Guha, Stanforth, Robert, Natarajan, Vivek, Ledsam, Joseph R., MacWilliams, Patricia, Kohli, Pushmeet, Karthikesalingam, Alan, Kohl, Simon, Cemgil, Taylan, Eslami, S. M. Ali, Ronneberger, Olaf
Reliable detection of out-of-distribution (OOD) inputs is increasingly understood to be a precondition for deployment of machine learning systems. This paper proposes and investigates the use of contrastive training to boost OOD detection performance. Unlike leading methods for OOD detection, our approach does not require access to examples labeled explicitly as OOD, which can be difficult to collect in practice. We show in extensive experiments that contrastive training significantly helps OOD detection performance on a number of common benchmarks. By introducing and employing the Confusion Log Probability (CLP) score, which quantifies the difficulty of the OOD detection task by capturing the similarity of inlier and outlier datasets, we show that our method especially improves performance in the `near OOD' classes -- a particularly challenging setting for previous methods.
Predicting Illegal Fishing on the Patagonia Shelf from Oceanographic Seascapes
Woodill, A. John, Kavanaugh, Maria, Harte, Michael, Watson, James R.
Many of the world's most important fisheries are experiencing increases in illegal fishing, undermining efforts to sustainably conserve and manage fish stocks. A major challenge to ending illegal, unreported, and unregulated (IUU) fishing is improving our ability to identify whether a vessel is fishing illegally and where illegal fishing is likely to occur in the ocean. However, monitoring the oceans is costly, time-consuming, and logistically challenging for maritime authorities to patrol. To address this problem, we use vessel tracking data and machine learning to predict illegal fishing on the Patagonian Shelf, one of the world's most productive regions for fisheries. Specifically, we focus on Chinese fishing vessels, which have consistently fished illegally in this region. We combine vessel location data with oceanographic seascapes -- classes of oceanic areas based on oceanographic variables -- as well as other remotely sensed oceanographic variables to train a series of machine learning models of varying levels of complexity. These models are able to predict whether a Chinese vessel is operating illegally with 69-96% confidence, depending on the year and predictor variables used. These results offer a promising step towards preempting illegal activities, rather than reacting to them forensically.
Reactive Soft Prototype Computing for Concept Drift Streams
Raab, Christoph, Heusinger, Moritz, Schleif, Frank-Michael
The amount of real-time communication between agents in an information system has increased rapidly since the beginning of the decade. This is because the use of these systems, e. g. social media, has become commonplace in today's society. This requires analytical algorithms to learn and predict this stream of information in real-time. The nature of these systems is non-static and can be explained, among other things, by the fast pace of trends. This creates an environment in which algorithms must recognize changes and adapt. Recent work shows vital research in the field, but mainly lack stable performance during model adaptation. In this work, a concept drift detection strategy followed by a prototype-based adaptation strategy is proposed. Validated through experimental results on a variety of typical non-static data, our solution provides stable and quick adjustments in times of change.
Solving Constrained CASH Problems with ADMM
Ram, Parikshit, Liu, Sijia, Vijaykeerthi, Deepak, Wang, Dakuo, Bouneffouf, Djallel, Bramble, Greg, Samulowitz, Horst, Gray, Alexander G.
The CASH problem has been widely studied in the context of automated configurations of machine learning (ML) pipelines and various solvers and toolkits are available. However, CASH solvers do not directly handle black-box constraints such as fairness, robustness or other domain-specific custom constraints. We present our recent approach [Liu, et al., 2020] that leverages the ADMM optimization framework to decompose CASH into multiple small problems and demonstrate how ADMM facilitates incorporation of black-box constraints.