Goto

Collaborating Authors

 Performance Analysis


Filter Methods for Feature Selection in Supervised Machine Learning Applications -- Review and Benchmark

arXiv.org Machine Learning

The amount of data for machine learning (ML) applications is constantly growing. Not only the number of observations, especially the number of measured variables (features) increases with ongoing digitization. Selecting the most appropriate features for predictive modeling is an important lever for the success of ML applications in business and research. Feature selection methods (FSM) that are independent of a certain ML algorithm - so-called filter methods - have been numerously suggested, but little guidance for researchers and quantitative modelers exists to choose appropriate approaches for typical ML problems. This review synthesizes the substantial literature on feature selection benchmarking and evaluates the performance of 58 methods in the widely used R environment. For concrete guidance, we consider four typical dataset scenarios that are challenging for ML models (noisy, redundant, imbalanced data and cases with more features than observations). Drawing on the experience of earlier benchmarks, which have considered much fewer FSMs, we compare the performance of the methods according to four criteria (predictive performance, number of relevant features selected, stability of the feature sets and runtime). We found methods relying on the random forest approach, the double input symmetrical relevance filter (DISR) and the joint impurity filter (JIM) were well-performing candidate methods for the given dataset scenarios.


Exploration of Dark Chemical Genomics Space via Portal Learning: Applied to Targeting the Undruggable Genome and COVID-19 Anti-Infective Polypharmacology

arXiv.org Artificial Intelligence

Advances in biomedicine are largely fueled by exploring uncharted territories of human biology. Machine learning can both enable and accelerate discovery, but faces a fundamental hurdle when applied to unseen data with distributions that differ from previously observed ones -- a common dilemma in scientific inquiry. We have developed a new deep learning framework, called {\textit{Portal Learning}}, to explore dark chemical and biological space. Three key, novel components of our approach include: (i) end-to-end, step-wise transfer learning, in recognition of biology's sequence-structure-function paradigm, (ii) out-of-cluster meta-learning, and (iii) stress model selection. Portal Learning provides a practical solution to the out-of-distribution (OOD) problem in statistical machine learning. Here, we have implemented Portal Learning to predict chemical-protein interactions on a genome-wide scale. Systematic studies demonstrate that Portal Learning can effectively assign ligands to unexplored gene families (unknown functions), versus existing state-of-the-art methods, thereby allowing us to target previously "undruggable" proteins and design novel polypharmacological agents for disrupting interactions between SARS-CoV-2 and human proteins. Portal Learning is general-purpose and can be further applied to other areas of scientific inquiry.


Personalized Cancer Diagnosis Using Machine Learning

#artificialintelligence

This is a case study on the personalized cancer diagnosis problem. Before diving deep into the issue, let us understand what are the challenges with cancer diagnosis and how machine learning can help in mitigating them. Note: This problem is taken from NIPS 2017 Competition and the details can be found using this link. Let us go through the current process first. In order to identify if a person has cancer or not, a specialist first creates a list of genetic variations that needs to be analyzed. He/she then searches for all the relevant evidences like published journals etc.


Bridging the reality gap in quantum devices with physics-aware machine learning

arXiv.org Artificial Intelligence

We use transport measurements of an electrostatically-defined quantum dot device in an AlGaAs/GaAs heterostructure to inform and verify our approach. Differences between theory and experiment pervade all of science, and are one of the driving forces of human discovery. To infer the disorder potential we use a combination Simulations often require fewer resources than real experiments of transport measurements and predictions from a physical but rarely capture the full complexity of a system, limiting model. The physical model is an electrostatic simulation from their practical application. Narrowing the gap between which transport features can be estimated. Many simulations a model and the real world is key for the control of complex with different parameter settings are required to compare this systems using machine learning, especially when a machine physical model with transport measurements. To accommodate learning model is trained on a simulation before being applied this need without extreme computation times, we develop to real systems [1, 2]. The reality gap is widened further when a fast approximation of the model using deep learning.


Flexible Bayesian Nonlinear Model Configuration

Journal of Artificial Intelligence Research

Regression models are used in a wide range of applications providing a powerful scientific tool for researchers from different fields. Linear, or simple parametric, models are often not sufficient to describe complex relationships between input variables and a response. Such relationships can be better described through ย flexible approaches such as neural networks, but this results in less interpretable models and potential overfitting. Alternatively, specific parametric nonlinear functions can be used, but the specification of such functions is in general complicated. In this paper, we introduce a ย flexible approach for the construction and selection of highly ย flexible nonlinear parametric regression models. Nonlinear features are generated hierarchically, similarly to deep learning, but have additional ย flexibility on the possible types of features to be considered. This ย flexibility, combined with variable selection, allows us to find a small set of important features and thereby more interpretable models. Within the space of possible functions, a Bayesian approach, introducing priors for functions based on their complexity, is considered. A genetically modified mode jumping Markov chain Monte Carlo algorithm is adopted to perform Bayesian inference and estimate posterior probabilities for model averaging. In various applications, we illustrate how our approach is used to obtain meaningful nonlinear models. Additionally, we compare its predictive performance with several machine learning algorithms. ย 


Machine unlearning via GAN

arXiv.org Artificial Intelligence

Machine learning models, especially deep models, may unintentionally remember information about their training data. Malicious attackers can thus pilfer some property about training data by attacking the model via membership inference attack or model inversion attack. Some regulations, such as the EU's GDPR, have enacted "The Right to Be Forgotten" to protect users' data privacy, enhancing individuals' sovereignty over their data. Therefore, removing training data information from a trained model has become a critical issue. In this paper, we present a GAN-based algorithm to delete data in deep models, which significantly improves deleting speed compared to retraining from scratch, especially in complicated scenarios. We have experimented on five commonly used datasets, and the experimental results show the efficiency of our method.


Distinguishing Engagement Facets: An Essential Component for AI-based Healthcare

arXiv.org Artificial Intelligence

Engagement in Human-Machine Interaction is the process by which entities participating in the interaction establish, maintain, and end their perceived connection. It is essential to monitor the engagement state of patients in various AI-based healthcare paradigms. This includes medical conditions that alter social behavior such as Autism Spectrum Disorder (ASD) or Attention-Deficit/Hyperactivity Disorder (ADHD). Engagement is a multifaceted construct which is composed of behavioral, emotional, and mental components. Previous research has neglected the multi-faceted nature of engagement. In this paper, a system is presented to distinguish these facets using contextual and relational features. This can facilitate further fine-grained analysis. Several machine learning classifiers including traditional and deep learning models are compared for this task. A highest accuracy of 74.57% with an F-Score and mean absolute error of 0.74 and 0.23 respectively was obtained on a balanced dataset of 22242 instances with neural network-based classification.


Uncertainty-Aware Multiple Instance Learning from Large-Scale Long Time Series Data

arXiv.org Artificial Intelligence

We propose a novel framework to classify large-scale time series data with long duration. Long time seriesclassification (L-TSC) is a challenging problem because the dataoften contains a large amount of irrelevant information to theclassification target. The irrelevant period degrades the classifica-tion performance while the relevance is unknown to the system.This paper proposes an uncertainty-aware multiple instancelearning (MIL) framework to identify the most relevant periodautomatically. The predictive uncertainty enables designing anattention mechanism that forces the MIL model to learn from thepossibly discriminant period. Moreover, the predicted uncertaintyyields a principled estimator to identify whether a prediction istrustworthy or not. We further incorporate another modality toaccommodate unreliable predictions by training a separate modelbased on its availability and conduct uncertainty aware fusion toproduce the final prediction. Systematic evaluation is conductedon the Automatic Identification System (AIS) data, which is col-lected to identify and track real-world vessels. Empirical resultsdemonstrate that the proposed method can effectively detect thetypes of vessels based on the trajectory and the uncertainty-awarefusion with other available data modality (Synthetic-ApertureRadar or SAR imagery is used in our experiments) can furtherimprove the detection accuracy.


Survivor Series 2021: What to know about the WWE PPV

FOX News

Fox News Flash top headlines are here. Check out what's clicking on Foxnews.com. One of WWE's premier events is back in front of a live audience after the coronavirus pandemic forced fans to stay home and watch from the comfort of their own homes. Survivor Series will take place at the Barclays Center in Brooklyn on Sunday with some of the top stars on both the Raw and SmackDown brands in action against each other. Big E is the RAW WWE champion.


Few-Shot Machine Learning Explained: Examples, Applications, Research

#artificialintelligence

Data is what powers machine learning solutions. Quality datasets enable training models with the needed detection and classification accuracy, though sometimes the accumulation of sufficient and applicable training data that should be fed into the model is a complex challenge. For instance, to create data-intensive apps human annotators are required to label a huge number of samples, which results in complexity of management and high costs for businesses. In addition to that, there is the difficulty associated with data acquisition related to safety regulations, privacy, or ethical concerns. When we have a limited dataset including only a finite number of samples per class, few-shot learning may be useful.