Accuracy
A Review of and Roadmap for Data Science and Machine Learning for the Neuropsychiatric Phenotype of Autism
Washington, Peter, Wall, Dennis P.
Autism Spectrum Disorder (autism) is a neurodevelopmental delay which affects at least 1 in 44 children. Like many neurological disorder phenotypes, the diagnostic features are observable, can be tracked over time, and can be managed or even eliminated through proper therapy and treatments. Yet, there are major bottlenecks in the diagnostic, therapeutic, and longitudinal tracking pipelines for autism and related delays, creating an opportunity for novel data science solutions to augment and transform existing workflows and provide access to services for more affected families. Several prior efforts conducted by a multitude of research labs have spawned great progress towards improved digital diagnostics and digital therapies for children with autism. We review the literature of digital health methods for autism behavior quantification using data science. We describe both case-control studies and classification systems for digital phenotyping. We then discuss digital diagnostics and therapeutics which integrate machine learning models of autism-related behaviors, including the factors which must be addressed for translational use. Finally, we describe ongoing challenges and potent opportunities for the field of autism data science. Given the heterogeneous nature of autism and the complexities of the relevant behaviors, this review contains insights which are relevant to neurological behavior analysis and digital psychiatry more broadly.
Benchmark of Data Preprocessing Methods for Imbalanced Classification
Haluลกka, Radovan, Brabec, Jan, Komรกrek, Tomรกลก
Severe class imbalance is one of the main conditions that make machine learning in cybersecurity difficult. A variety of dataset preprocessing methods have been introduced over the years. These methods modify the training dataset by oversampling, undersampling or a combination of both to improve the predictive performance of classifiers trained on this dataset. Although these methods are used in cybersecurity occasionally, a comprehensive, unbiased benchmark comparing their performance over a variety of cybersecurity problems is missing. This paper presents a benchmark of 16 preprocessing methods on six cybersecurity datasets together with 17 public imbalanced datasets from other domains. We test the methods under multiple hyperparameter configurations and use an AutoML system to train classifiers on the preprocessed datasets, which reduces potential bias from specific hyperparameter or classifier choices. Special consideration is also given to evaluating the methods using appropriate performance measures that are good proxies for practical performance in real-world cybersecurity systems. The main findings of our study are: 1) Most of the time, a data preprocessing method that improves classification performance exists. 2) Baseline approach of doing nothing outperformed a large portion of methods in the benchmark. 3) Oversampling methods generally outperform undersampling methods. 4) The most significant performance gains are brought by the standard SMOTE algorithm and more complicated methods provide mainly incremental improvements at the cost of often worse computational performance.
A Topological Distance Measure between Multi-Fields for Classification and Analysis of Shapes and Data
Ramamurthi, Yashwanth, Chattopadhyay, Amit
Distance measures play an important role in shape classification and data analysis problems. Topological distances based on Reeb graphs and persistence diagrams have been employed to obtain effective algorithms in shape matching and scalar data analysis. In the current paper, we propose an improved distance measure between two multi-fields by computing a multi-dimensional Reeb graph (MDRG) each of which captures the topology of a multi-field through a hierarchy of Reeb graphs in different dimensions. A hierarchy of persistence diagrams is then constructed by computing a persistence diagram corresponding to each Reeb graph of the MDRG. Based on this representation, we propose a novel distance measure between two MDRGs by extending the bottleneck distance between two Reeb graphs. We show that the proposed measure satisfies the pseudo-metric and stability properties. We examine the effectiveness of the proposed multi-field topology-based measure on two different applications: (1) shape classification and (2) detection of topological features in a time-varying multi-field data. In the shape classification problem, the performance of the proposed measure is compared with the well-known topology-based measures in shape matching. In the second application, we consider a time-varying volumetric multi-field data from the field of computational chemistry where the goal is to detect the site of stable bond formation between Pt and CO molecules. We demonstrate the ability of the proposed distance in classifying each of the sites as occurring before and after the bond stabilization.
Bit Error and Block Error Rate Training for ML-Assisted Communication
Wiesmayr, Reinhard, Marti, Gian, Dick, Chris, Song, Haochuan, Studer, Christoph
Even though machine learning (ML) techniques are being widely used in communications, the question of how to train communication systems has received surprisingly little attention. In this paper, we show that the commonly used binary cross-entropy (BCE) loss is a sensible choice in uncoded systems, e.g., for training ML-assisted data detectors, but may not be optimal in coded systems. We propose new loss functions targeted at minimizing the block error rate and SNR deweighting, a novel method that trains communication systems for optimal performance over a range of signal-to-noise ratios. The utility of the proposed loss functions as well as of SNR deweighting is shown through simulations in NVIDIA Sionna.
Implementation of a noisy hyperlink removal system: A semantic and relatedness approach
Taghandiki, Kazem, Ehsan, Elnaz Rezaei
As the volume of data on the web grows, the web structure graph, which is a graph representation of the web, continues to evolve. The structure of this graph has gradually shifted from content-based to non-content-based. Furthermore, spam data, such as noisy hyperlinks, in the web structure graph adversely affect the speed and efficiency of information retrieval and link mining algorithms. Previous works in this area have focused on removing noisy hyperlinks using structural and string approaches. However, these approaches may incorrectly remove useful links or be unable to detect noisy hyperlinks in certain circumstances. In this paper, a data collection of hyperlinks is initially constructed using an interactive crawler. The semantic and relatedness structure of the hyperlinks is then studied through semantic web approaches and tools such as the DBpedia ontology. Finally, the removal process of noisy hyperlinks is carried out using a reasoner on the DBpedia ontology. Our experiments demonstrate the accuracy and ability of semantic web technologies to remove noisy hyperlinks
Same Same, But Different: Conditional Multi-Task Learning for Demographic-Specific Toxicity Detection
Gupta, Soumyajit, Lee, Sooyong, De-Arteaga, Maria, Lease, Matthew
In developing natural language processing (NLP) models to detect toxic language (Arango et al., 2019; Schmidt and Wiegand, 2017; Vaidya et al., 2020), we typically assume that toxic language manifests in similar forms across different targeted groups. For example, HateCheck (Rรถttger et al., 2021) enumerates templatic patterns such as "I hate [GROUP]" that we expect detection models to handle robustly across groups. Moreover, we typically pool data across different demographic targets in model training in order to learn general patterns of linguistic toxicity across diverse demographic targets. However, the nature and form of toxic language used to target different demographic groups can vary quite markedly. Furthermore, an imbalanced distribution of different demographic groups in toxic language datasets risks over-fitting forms of toxic language most relevant to the majority group(s), potentially at the expense of systematically weaker model performance on minority group(s). For this reason, a "one-size-fits-all" modeling approach may yield sub-optimal performance and more specifically raise concerns of algorithmic fairness (Arango et al., 2019; Park et al., 2018; Sap et al., 2019). At the same time, radically siloing off datasets for each different demographic target group would prevent models from learning broader linguistic patterns of toxicity across different demographic groups targeted.
Learning Object Manipulation With Under-Actuated Impulse Generator Arrays
Kong, Chuizheng, Yerazunis, William, Nikovski, Daniel
For more than half a century, vibratory bowl feeders have been the standard in automated assembly for singulation, orientation, and manipulation of small parts. Unfortunately, these feeders are expensive, noisy, and highly specialized on a single part design bases. We consider an alternative device and learning control method for singulation, orientation, and manipulation by means of seven fixed-position variable-energy solenoid impulse actuators located beneath a semi-rigid part supporting surface. Using computer vision to provide part pose information, we tested various machine learning (ML) algorithms to generate a control policy that selects the optimal actuator and actuation energy. Our manipulation test object is a 6-sided craps-style die. Using the most suitable ML algorithm, we were able to flip the die to any desired face 30.4\% of the time with a single impulse, and 51.3\% with two chosen impulses, versus a random policy succeeding 5.1\% of the time (that is, a randomly chosen impulse delivered by a randomly chosen solenoid).
Students Parrot Their Teachers: Membership Inference on Model Distillation
Jagielski, Matthew, Nasr, Milad, Choquette-Choo, Christopher, Lee, Katherine, Carlini, Nicholas
Model distillation (Hinton et al., 2015) is a common framework for knowledge transfer, where knowledge learned by a "teacher model" is transferred to a "student model" via the teacher's predictions. Distillation is helpful because the teacher's predictions are a more useful guide for the student model than hard labels; this phenomenon has been explained by the teacher's predictions containing some useful "dark knowledge". Variants of model distillation have been proposed for, e.g., model compression (Hinton et al., 2015; Ba & Caruana, 2014; Polino et al., 2018; Kim et al., 2018; Sun et al., 2019) or training more accurate models (Zagoruyko & Komodakis, 2016; Xie et al., 2020). Within the privacy-preserving machine learning community, distillation has been adapted to protect the privacy of a training dataset (Papernot et al., 2016; Tang et al., 2022; Shejwalkar & Houmansadr, 2021; Mazzone et al., 2022). Many of these approaches rely on the intuition that distilling the teacher model serves as a privacy barrier that protects the teacher's training data. Informally, restricting the student to learn only from the teacher's predictions is a form of data minimization, which should result in less private information being fed into, and memorized by, the student. This privacy barrier around the teacher also allows the teacher model to be trained with strong, non-private, training approaches, improving both the teacher model's and student model's accuracy. Because model distillation does not provide a rigorous privacy guarantee (such as those offered by differential privacy (Dwork et al., 2006)), in our work we evaluate the empirical privacy provided by these
Non-Parametric Outlier Synthesis
Tao, Leitian, Du, Xuefeng, Zhu, Xiaojin, Li, Yixuan
Out-of-distribution (OOD) detection is indispensable for safely deploying machine learning models in the wild. One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data. Recent work on outlier synthesis modeled the feature space as parametric Gaussian distribution, a strong and restrictive assumption that might not hold in reality. In this paper, we propose a novel framework, Non-Parametric Outlier Synthesis (NPOS), which generates artificial OOD training data and facilitates learning a reliable decision boundary between ID and OOD data. Importantly, our proposed synthesis approach does not make any distributional assumption on the ID embeddings, thereby offering strong flexibility and generality. We show that our synthesis approach can be mathematically interpreted as a rejection sampling framework. Extensive experiments show that NPOS can achieve superior OOD detection performance, outperforming the competitive rivals by a significant margin. Code is publicly available at https://github.com/deeplearning-wisc/npos.
Deep Age-Invariant Fingerprint Segmentation System
Murshed, M. G. Sarwar, Bahmani, Keivan, Schuckers, Stephanie, Hussain, Faraz
Fingerprint-based identification systems achieve higher accuracy when a slap containing multiple fingerprints of a subject is used instead of a single fingerprint. However, segmenting or auto-localizing all fingerprints in a slap image is a challenging task due to the different orientations of fingerprints, noisy backgrounds, and the smaller size of fingertip components. The presence of slap images in a real-world dataset where one or more fingerprints are rotated makes it challenging for a biometric recognition system to localize and label the fingerprints automatically. Improper fingerprint localization and finger labeling errors lead to poor matching performance. In this paper, we introduce a method to generate arbitrary angled bounding boxes using a deep learning-based algorithm that precisely localizes and labels fingerprints from both axis-aligned and over-rotated slap images. We built a fingerprint segmentation model named CRFSEG (Clarkson Rotated Fingerprint segmentation Model) by updating the previously proposed CFSEG model which was based on traditional Faster R-CNN architecture [21]. CRFSEG improves upon the Faster R-CNN algorithm with arbitrarily angled bounding boxes that allow the CRFSEG to perform better in challenging slap images. After training the CRFSEG algorithm on a new dataset containing slap images collected from both adult and children subjects, our results suggest that the CRFSEG model was invariant across different age groups and can handle over-rotated slap images successfully. In the Combined dataset containing both normal and rotated images of adult and children subjects, we achieved a matching accuracy of 97.17%, which outperformed state-of-the-art VeriFinger (94.25%) and NFSEG segmentation systems (80.58%).