Goto

Collaborating Authors

 Performance Analysis


Domain Generalization via Domain-based Covariance Minimization

arXiv.org Machine Learning

Researchers have been facing a difficult problem that data generation mechanisms could be influenced by internal or external factors leading to the training and test data with quite different distributions, consequently traditional classification or regression from the training set is unable to achieve satisfying results on test data. In this paper, we address this nontrivial domain generalization problem by finding a central subspace in which domain-based covariance is minimized while the functional relationship is simultaneously maximally preserved. We propose a novel variance measurement for multiple domains so as to minimize the difference between conditional distributions across domains with solid theoretical demonstration and supports, meanwhile, the algorithm preserves the functional relationship via maximizing the variance of conditional expectations given output. Furthermore, we also provide a fast implementation that requires much less computation and smaller memory for large-scale matrix operations, suitable for not only domain generalization but also other kernel-based eigenvalue decompositions. To show the practicality of the proposed method, we compare our methods against some well-known dimension reduction and domain generalization techniques on both synthetic data and real-world applications. We show that for small-scale datasets, we are able to achieve better quantitative results indicating better generalization performance over unseen test datasets. For large-scale problems, the proposed fast implementation maintains the quantitative performance but at a substantially lower computational cost.


Gated Information Bottleneck for Generalization in Sequential Environments

arXiv.org Machine Learning

Deep neural networks suffer from poor generalization to unseen environments when the underlying data distribution is different from that in the training set. By learning minimum sufficient representations from training data, the information bottleneck (IB) approach has demonstrated its effectiveness to improve generalization in different AI applications. In this work, we propose a new neural network-based IB approach, termed gated information bottleneck (GIB), that dynamically drops spurious correlations and progressively selects the most task-relevant features across different environments by a trainable soft mask (on raw features). GIB enjoys a simple and tractable objective, without any variational approximation or distributional assumption. We empirically demonstrate the superiority of GIB over other popular neural network-based IB approaches in adversarial robustness and out-of-distribution (OOD) detection. Meanwhile, we also establish the connection between IB theory and invariant causal representation learning, and observed that GIB demonstrates appealing performance when different environments arrive sequentially, a more practical scenario where invariant risk minimization (IRM) fails. Code of GIB is available at https://github.com/falesiani/GIB


On the Self-Penalization Phenomenon in Feature Selection

arXiv.org Machine Learning

We describe an implicit sparsity-inducing mechanism based on minimization over a family of kernels: \begin{equation*} \min_{\beta, f}~\widehat{\mathbb{E}}[L(Y, f(\beta^{1/q} \odot X)] + \lambda_n \|f\|_{\mathcal{H}_q}^2~~\text{subject to}~~\beta \ge 0, \end{equation*} where $L$ is the loss, $\odot$ is coordinate-wise multiplication and $\mathcal{H}_q$ is the reproducing kernel Hilbert space based on the kernel $k_q(x, x') = h(\|x-x'\|_q^q)$, where $\|\cdot\|_q$ is the $\ell_q$ norm. Using gradient descent to optimize this objective with respect to $\beta$ leads to exactly sparse stationary points with high probability. The sparsity is achieved without using any of the well-known explicit sparsification techniques such as penalization (e.g., $\ell_1$), early stopping or post-processing (e.g., clipping). As an application, we use this sparsity-inducing mechanism to build algorithms consistent for feature selection.


Information Theoretic Structured Generative Modeling

arXiv.org Machine Learning

R\'enyi's information provides a theoretical foundation for tractable and data-efficient non-parametric density estimation, based on pair-wise evaluations in a reproducing kernel Hilbert space (RKHS). This paper extends this framework to parametric probabilistic modeling, motivated by the fact that R\'enyi's information can be estimated in closed-form for Gaussian mixtures. Based on this special connection, a novel generative model framework called the structured generative model (SGM) is proposed that makes straightforward optimization possible, because costs are scale-invariant, avoiding high gradient variance while imposing less restrictions on absolute continuity, which is a huge advantage in parametric information theoretic optimization. The implementation employs a single neural network driven by an orthonormal input appended to a single white noise source adapted to learn an infinite Gaussian mixture model (IMoG), which provides an empirically tractable model distribution in low dimensions. To train SGM, we provide three novel variational cost functions, based on R\'enyi's second-order entropy and divergence, to implement minimization of cross-entropy, minimization of variational representations of $f$-divergence, and maximization of the evidence lower bound (conditional probability). We test the framework for estimation of mutual information and compare the results with the mutual information neural estimation (MINE), for density estimation, for conditional probability estimation in Markov models as well as for training adversarial networks. Our preliminary results show that SGM significantly improves MINE estimation in terms of data efficiency and variance, conventional and variational Gaussian mixture models, as well as the performance of generative adversarial networks.


Quantifying With Only Positive Training Data

arXiv.org Machine Learning

Quantification is the research field that studies methods for counting the number of data points that belong to each class in an unlabeled sample. Traditionally, researchers in this field assume the availability of labelled observations for all classes to induce a quantification model. However, we often face situations where the number of classes is large or even unknown, or we have reliable data for a single class. When inducing a multi-class quantifier is infeasible, we are often concerned with estimates for a specific class of interest. In this context, we have proposed a novel setting known as One-class Quantification (OCQ). In contrast, Positive and Unlabeled Learning (PUL), another branch of Machine Learning, has offered solutions to OCQ, despite quantification not being the focal point of PUL. This article closes the gap between PUL and OCQ and brings both areas together under a unified view. We compare our method, Passive Aggressive Threshold (PAT), against PUL methods and show that PAT generally is the fastest and most accurate algorithm. PAT induces quantification models that can be reused to quantify different samples of data. We additionally introduce Exhaustive TIcE (ExTIcE), an improved version of the PUL algorithm Tree Induction for c Estimation (TIcE). We show that ExTIcE quantifies more accurately than PAT and the other assessed algorithms in scenarios where several negative observations are identical to the positive ones.


C3PU: Cross-Coupling Capacitor Processing Unit Using Analog-Mixed Signal In-Memory Computing for AI Inference

arXiv.org Artificial Intelligence

This paper presents a novel cross-coupling capacitor processing unit (C3PU) that supports analog-mixed signal in memory computing to perform multiply-and-accumulate (MAC) operations. The C3PU consists of a capacitive unit, a CMOS transistor, and a voltage-to-time converter (VTC). The capacitive unit serves as a computational element that holds the multiplier operand and performs multiplication once the multiplicand is applied at the terminal. The multiplicand is the input voltage that is converted to a pulse width signal using a low power VTC. The transistor transfers this multiplication where a voltage level is generated. A demonstrator of 5x4 C3PU array that is capable of implementing 4 MAC units is presented. The design has been verified using Monte Carlo simulation in 65 nm technology. The 5x4 C3PU consumed energy of 66.4 fJ/MAC at 0.3 V voltage supply with an error of 5.7%. The proposed unit achieves lower energy and occupies a smaller area by 3.4x and 3.6x, respectively, with similar error value when compared to a digital-based 8x4-bit fixed point MAC unit. The C3PU has been utilized through an iris fower classification utilizing an artificial neural network which achieved a 90% classification accuracy compared to ideal accuracy of 96.67% using MATLAB.


Accurate and Generalizable Quantitative Scoring of Liver Steatosis from Ultrasound Images via Scalable Deep Learning

arXiv.org Artificial Intelligence

Background & Aims: Hepatic steatosis is a major cause of chronic liver disease. 2D ultrasound is the most widely used non-invasive tool for screening and monitoring, but associated diagnoses are highly subjective. We developed a scalable deep learning (DL) algorithm for quantitative scoring of liver steatosis from 2D ultrasound images. Approach & Results: Using retrospectively collected multi-view ultrasound data from 3,310 patients, 19,513 studies, and 228,075 images, we trained a DL algorithm to diagnose steatosis stages (healthy, mild, moderate, or severe) from ultrasound diagnoses. Performance was validated on two multi-scanner unblinded and blinded (initially to DL developer) histology-proven cohorts (147 and 112 patients) with histopathology fatty cell percentage diagnoses, and a subset with FibroScan diagnoses. We also quantified reliability across scanners and viewpoints. Results were evaluated using Bland-Altman and receiver operating characteristic (ROC) analysis. The DL algorithm demonstrates repeatable measurements with a moderate number of images (3 for each viewpoint) and high agreement across 3 premium ultrasound scanners. High diagnostic performance was observed across all viewpoints: area under the curves of the ROC to classify >=mild, >=moderate, =severe steatosis grades were 0.85, 0.90, and 0.93, respectively. The DL algorithm outperformed or performed at least comparably to FibroScan with statistically significant improvements for all levels on the unblinded histology-proven cohort, and for =severe steatosis on the blinded histology-proven cohort. Conclusions: The DL algorithm provides a reliable quantitative steatosis assessment across view and scanners on two multi-scanner cohorts. Diagnostic performance was high with comparable or better performance than FibroScan.


Offensive Language Detection with BERT-based models, By Customizing Attention Probabilities

arXiv.org Artificial Intelligence

This paper describes a novel study on using `Attention Mask' input in transformers and using this approach for detecting offensive content in both English and Persian languages. The paper's principal focus is to suggest a methodology to enhance the performance of the BERT-based models on the `Offensive Language Detection' task. Therefore, we customize attention probabilities by changing the `Attention Mask' input to create more efficacious word embeddings. To do this, we firstly tokenize the training set of the exploited datasets (by BERT tokenizer). Then, we apply Multinomial Naive Bayes to map these tokens to two probabilities. These probabilities indicate the likelihood of making a text non-offensive or offensive, provided that it contains that token. Afterwards, we use these probabilities to define a new term, namely Offensive Score. Next, we create two separate (because of the differences in the types of the employed datasets) equations based on Offensive Scores for each language to re-distribute the `Attention Mask' input for paying more attention to more offensive phrases. Eventually, we put the F1-macro score as our evaluation metric and fine-tune several combinations of BERT with ANNs, CNNs and RNNs to examine the effect of using this methodology on various combinations. The results indicate that all models will enhance with this methodology. The most improvement was 2% and 10% for English and Persian languages, respectively.


Bayesian Regularization for Functional Graphical Models

arXiv.org Machine Learning

Graphical models, used to express conditional dependence between random variables observed at various nodes, are used extensively in many fields such as genetics, neuroscience, and social network analysis. While most current statistical methods for estimating graphical models focus on scalar data, there is interest in estimating analogous dependence structures when the data observed at each node are functional, such as signals or images. In this paper, we propose a fully Bayesian regularization scheme for estimating functional graphical models. We first consider a direct Bayesian analog of the functional graphical lasso proposed by Qiao et al. (2019). We then propose a regularization strategy via the graphical horseshoe. We compare these approaches via simulation study and apply our proposed functional graphical horseshoe to two motivating applications, electroencephalography data for comparing brain activation between an alcoholic group and controls, as well as changes in structural connectivity in the presence of traumatic brain injury (TBI). Our results yield insight into how the brain attempts to compensate for disconnected networks after injury.


Multiway sparse distance weighted discrimination

arXiv.org Machine Learning

Modern data often take the form of a multiway array. However, most classification methods are designed for vectors, i.e., 1-way arrays. Distance weighted discrimination (DWD) is a popular high-dimensional classification method that has been extended to the multiway context, with dramatic improvements in performance when data have multiway structure. However, the previous implementation of multiway DWD was restricted to classification of matrices, and did not account for sparsity. In this paper, we develop a general framework for multiway classification which is applicable to any number of dimensions and any degree of sparsity. We conducted extensive simulation studies, showing that our model is robust to the degree of sparsity and improves classification accuracy when the data have multiway structure. For our motivating application, magnetic resonance spectroscopy (MRS) was used to measure the abundance of several metabolites across multiple neurological regions and across multiple time points in a mouse model of Friedreich's ataxia, yielding a four-way data array. Our method reveals a robust and interpretable multi-region metabolomic signal that discriminates the groups of interest. We also successfully apply our method to gene expression time course data for multiple sclerosis treatment. An R implementation is available in the package MultiwayClassification at http://github.com/lockEF/MultiwayClassification .