Accuracy
Predictive capacity of AI models
Predictive machine-learning models based on neural networks are extremely powerful when judging large data sets. But understanding them is notoriously difficult. Neural networks are trained using labeled data sets. How well they perform is validated using a labeled test set. This is where model accuracy, confusion matrices, ROCs, etc. come in handy.
Ontology-Based Skill Description Learning for Flexible Production Systems
Himmelhuber, Anna, Grimm, Stephan, Runkler, Thomas, Zillner, Sonja
The increasing importance of resource-efficient production entails that manufacturing companies have to create a more dynamic production environment, with flexible manufacturing machines and processes. To fully utilize this potential of dynamic manufacturing through automatic production planning, formal skill descriptions of the machines are essential. However, generating those skill descriptions in a manual fashion is labor-intensive and requires extensive domain-knowledge. In this contribution an ontology-based semi-automatic skill description system that utilizes production logs and industrial ontologies through inductive logic programming is introduced and benefits and drawbacks of the proposed solution are evaluated.
Machine Learning for Real-Time, Automatic, and Early Diagnosis of Parkinson's Disease by Extracting Signs of Micrographia from Handwriting Images
Tyagi, Riya, Tyagi, Tanish, Wang, Ming, Zhang, Lujin
Parkinson's disease (PD) is debilitating, progressive, and clinically marked by motor symptoms. As the second most common neurodegenerative disease in the world, it affects over 10 million lives globally. Existing diagnoses methods have limitations, such as the expense of visiting doctors and the challenge of automated early detection, considering that behavioral differences in patients and healthy individuals are often indistinguishable in the early stages. However, micrographia, a handwriting disorder that leads to abnormally small handwriting, tremors, dystonia, and slow movement in the hands and fingers, is commonly observed in the early stages of PD. In this work, we apply machine learning techniques to extract signs of micrographia from drawing samples gathered from two open-source datasets and achieve a predictive accuracy of 94%. This work also sets the foundations for a publicly available and user-friendly web portal that anyone with access to a pen, printer, and phone can use for early PD detection.
Fairness for AUC via Feature Augmentation
Fong, Hortense, Kumar, Vineet, Mehrotra, Anay, Vishnoi, Nisheeth K.
We study fairness in the context of classification where the performance is measured by the area under the curve (AUC) of the receiver operating characteristic. AUC is commonly used when both Type I (false positive) and Type II (false negative) errors are important. However, the same classifier can have significantly varying AUCs for different protected groups and, in real-world applications, it is often desirable to reduce such cross-group differences. We address the problem of how to select additional features to most greatly improve AUC for the disadvantaged group. Our results establish that the unconditional variance of features does not inform us about AUC fairness but class-conditional variance does. Using this connection, we develop a novel approach, fairAUC, based on feature augmentation (adding features) to mitigate bias between identifiable groups. We evaluate fairAUC on synthetic and real-world (COMPAS) datasets and find that it significantly improves AUC for the disadvantaged group relative to benchmarks maximizing overall AUC and minimizing bias between groups.
Towards Inter-class and Intra-class Imbalance in Class-imbalanced Learning
Liu, Zhining, Wei, Pengfei, Wei, Zhepei, Yu, Boyang, Jiang, Jing, Cao, Wei, Bian, Jiang, Chang, Yi
Imbalanced Learning (IL) is an important problem that widely exists in data mining applications. Typical IL methods utilize intuitive class-wise resampling or reweighting to directly balance the training set. However, some recent research efforts in specific domains show that class-imbalanced learning can be achieved without class-wise manipulation. This prompts us to think about the relationship between the two different IL strategies and the nature of the class imbalance. Fundamentally, they correspond to two essential imbalances that exist in IL: the difference in quantity between examples from different classes as well as between easy and hard examples within a single class, i.e., inter-class and intra-class imbalance. Existing works fail to explicitly take both imbalances into account and thus suffer from suboptimal performance. In light of this, we present Duple-Balanced Ensemble, namely DUBE , a versatile ensemble learning framework. Unlike prevailing methods, DUBE directly performs inter-class and intra-class balancing without relying on heavy distance-based computation, which allows it to achieve competitive performance while being computationally efficient. We also present a detailed discussion and analysis about the pros and cons of different inter/intra-class balancing strategies based on DUBE . Extensive experiments validate the effectiveness of the proposed method. Code and examples are available at https://github.com/ICDE2022Sub/duplebalance.
Causal Regularization Using Domain Priors
Reddy, Abbavaram Gowtham, Kancheti, Sai Srinivas, Balasubramanian, Vineeth N, Sharma, Amit
Neural networks leverage both causal and correlation-based relationships in data to learn models that optimize a given performance criterion, such as classification accuracy. This results in learned models that may not necessarily reflect the true causal relationships between input and output. When domain priors of causal relationships are available at the time of training, it is essential that a neural network model maintains these relationships as causal, even as it learns to optimize the performance criterion. We propose a causal regularization method that can incorporate such causal domain priors into the network and which supports both direct and total causal effects. We show that this approach can generalize to various kinds of specifications of causal priors, including monotonicity of causal effect of a given input feature or removing a certain influence for purposes of fairness. Our experiments on eleven benchmark datasets show the usefulness of this approach in regularizing a learned neural network model to maintain desired causal effects. On most datasets, domain-prior consistent models can be obtained without compromising on accuracy.
Filter Methods for Feature Selection in Supervised Machine Learning Applications -- Review and Benchmark
Hopf, Konstantin, Reifenrath, Sascha
The amount of data for machine learning (ML) applications is constantly growing. Not only the number of observations, especially the number of measured variables (features) increases with ongoing digitization. Selecting the most appropriate features for predictive modeling is an important lever for the success of ML applications in business and research. Feature selection methods (FSM) that are independent of a certain ML algorithm - so-called filter methods - have been numerously suggested, but little guidance for researchers and quantitative modelers exists to choose appropriate approaches for typical ML problems. This review synthesizes the substantial literature on feature selection benchmarking and evaluates the performance of 58 methods in the widely used R environment. For concrete guidance, we consider four typical dataset scenarios that are challenging for ML models (noisy, redundant, imbalanced data and cases with more features than observations). Drawing on the experience of earlier benchmarks, which have considered much fewer FSMs, we compare the performance of the methods according to four criteria (predictive performance, number of relevant features selected, stability of the feature sets and runtime). We found methods relying on the random forest approach, the double input symmetrical relevance filter (DISR) and the joint impurity filter (JIM) were well-performing candidate methods for the given dataset scenarios.
Exploration of Dark Chemical Genomics Space via Portal Learning: Applied to Targeting the Undruggable Genome and COVID-19 Anti-Infective Polypharmacology
Cai, Tian, Xie, Li, Chen, Muge, Liu, Yang, He, Di, Zhang, Shuo, Mura, Cameron, Bourne, Philip E., Xie, Lei
Advances in biomedicine are largely fueled by exploring uncharted territories of human biology. Machine learning can both enable and accelerate discovery, but faces a fundamental hurdle when applied to unseen data with distributions that differ from previously observed ones -- a common dilemma in scientific inquiry. We have developed a new deep learning framework, called {\textit{Portal Learning}}, to explore dark chemical and biological space. Three key, novel components of our approach include: (i) end-to-end, step-wise transfer learning, in recognition of biology's sequence-structure-function paradigm, (ii) out-of-cluster meta-learning, and (iii) stress model selection. Portal Learning provides a practical solution to the out-of-distribution (OOD) problem in statistical machine learning. Here, we have implemented Portal Learning to predict chemical-protein interactions on a genome-wide scale. Systematic studies demonstrate that Portal Learning can effectively assign ligands to unexplored gene families (unknown functions), versus existing state-of-the-art methods, thereby allowing us to target previously "undruggable" proteins and design novel polypharmacological agents for disrupting interactions between SARS-CoV-2 and human proteins. Portal Learning is general-purpose and can be further applied to other areas of scientific inquiry.
Personalized Cancer Diagnosis Using Machine Learning
This is a case study on the personalized cancer diagnosis problem. Before diving deep into the issue, let us understand what are the challenges with cancer diagnosis and how machine learning can help in mitigating them. Note: This problem is taken from NIPS 2017 Competition and the details can be found using this link. Let us go through the current process first. In order to identify if a person has cancer or not, a specialist first creates a list of genetic variations that needs to be analyzed. He/she then searches for all the relevant evidences like published journals etc.
Bridging the reality gap in quantum devices with physics-aware machine learning
Craig, D. L., Moon, H., Fedele, F., Lennon, D. T., Van Straaten, B., Vigneau, F., Camenzind, L. C., Zumbühl, D. M., Briggs, G. A. D., Osborne, M. A., Sejdinovic, D., Ares, N.
We use transport measurements of an electrostatically-defined quantum dot device in an AlGaAs/GaAs heterostructure to inform and verify our approach. Differences between theory and experiment pervade all of science, and are one of the driving forces of human discovery. To infer the disorder potential we use a combination Simulations often require fewer resources than real experiments of transport measurements and predictions from a physical but rarely capture the full complexity of a system, limiting model. The physical model is an electrostatic simulation from their practical application. Narrowing the gap between which transport features can be estimated. Many simulations a model and the real world is key for the control of complex with different parameter settings are required to compare this systems using machine learning, especially when a machine physical model with transport measurements. To accommodate learning model is trained on a simulation before being applied this need without extreme computation times, we develop to real systems [1, 2]. The reality gap is widened further when a fast approximation of the model using deep learning.