Collaborating Authors Machine Learning

A learning perspective on the emergence of abstractions: the curious case of phonemes Machine Learning

In the present paper we use a range of modeling techniques to investigate whether an abstract phone could emerge from exposure to speech sounds. We test two opposing principles regarding the development of language knowledge in linguistically untrained language users: Memory-Based Learning (MBL) and Error-Correction Learning (ECL). A process of generalization underlies the abstractions linguists operate with, and we probed whether MBL and ECL could give rise to a type of language knowledge that resembles linguistic abstractions. Each model was presented with a significant amount of pre-processed speech produced by one speaker. We assessed the consistency or stability of what the models have learned and their ability to give rise to abstract categories. Both types of models fare differently with regard to these tests. We show that ECL learning models can learn abstractions and that at least part of the phone inventory can be reliably identified from the input.

Maximum Entropy competes with Maximum Likelihood Machine Learning

Maximum entropy (MAXENT) method has a large number of applications in theoretical and applied machine learning, since it provides a convenient non-parametric tool for estimating unknown probabilities. The method is a major contribution of statistical physics to probabilistic inference. However, a systematic approach towards its validity limits is currently missing. Here we study MAXENT in a Bayesian decision theory set-up, i.e. assuming that there exists a well-defined prior Dirichlet density for unknown probabilities, and that the average Kullback-Leibler (KL) distance can be employed for deciding on the quality and applicability of various estimators. These allow to evaluate the relevance of various MAXENT constraints, check its general applicability, and compare MAXENT with estimators having various degrees of dependence on the prior, viz. the regularized maximum likelihood (ML) and the Bayesian estimators. We show that MAXENT applies in sparse data regimes, but needs specific types of prior information. In particular, MAXENT can outperform the optimally regularized ML provided that there are prior rank correlations between the estimated random quantity and its probabilities.

Rank-One Measurements of Low-Rank PSD Matrices Have Small Feasible Sets Machine Learning

We study the role of the constraint set in determining the solution to low-rank, positive semidefinite (PSD) matrix sensing problems. The setting we consider involves rank-one sensing matrices: In particular, given a set of rank-one projections of an approximately low-rank PSD matrix, we characterize the radius of the set of PSD matrices that satisfy the measurements. This result yields a sampling rate to guarantee singleton solution sets when the true matrix is exactly low-rank, such that the choice of the objective function or the algorithm to be used is inconsequential in its recovery. We discuss applications of this contribution and compare it to recent literature regarding implicit regularization for similar problems. We demonstrate practical implications of this result by applying conic projection methods for PSD matrix recovery without incorporating low-rank regularization.

Stochastic Gradient Descent with Large Learning Rate Machine Learning

As a simple and efficient optimization method in deep learning, stochastic gradient descent (SGD) has attracted tremendous attention. In the vanishing learning rate regime, SGD is now relatively well understood, and the majority of theoretical approaches to SGD set their assumptions in the continuous-time limit. However, the continuous-time predictions are unlikely to reflect the experimental observations well because the practice often runs in the large learning rate regime, where the training is faster and the generalization of models are often better. In this paper, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and relating them to experimental observations. The main contributions of this work are to derive the stable distribution for discrete-time SGD in a quadratic loss function with and without momentum. Examples of applications of the proposed theory considered in this work include the approximation error of variants of SGD, the effect of mini-batch noise, the escape rate from a sharp minimum, and and the stationary distribution of a few second order methods.

Hardness of Learning Halfspaces with Massart Noise Machine Learning

We study the complexity of PAC learning halfspaces in the presence of Massart (bounded) noise. Specifically, given labeled examples $(x, y)$ from a distribution $D$ on $\mathbb{R}^{n} \times \{ \pm 1\}$ such that the marginal distribution on $x$ is arbitrary and the labels are generated by an unknown halfspace corrupted with Massart noise at rate $\eta<1/2$, we want to compute a hypothesis with small misclassification error. Characterizing the efficient learnability of halfspaces in the Massart model has remained a longstanding open problem in learning theory. Recent work gave a polynomial-time learning algorithm for this problem with error $\eta+\epsilon$. This error upper bound can be far from the information-theoretically optimal bound of $\mathrm{OPT}+\epsilon$. More recent work showed that {\em exact learning}, i.e., achieving error $\mathrm{OPT}+\epsilon$, is hard in the Statistical Query (SQ) model. In this work, we show that there is an exponential gap between the information-theoretically optimal error and the best error that can be achieved by a polynomial-time SQ algorithm. In particular, our lower bound implies that no efficient SQ algorithm can approximate the optimal error within any polynomial factor.

The Variational Method of Moments Machine Learning

The conditional moment problem is a powerful formulation for describing structural causal parameters in terms of observables, a prominent example being instrumental variable regression. A standard approach is to reduce the problem to a finite set of marginal moment conditions and apply the optimally weighted generalized method of moments (OWGMM), but this requires we know a finite set of identifying moments, can still be inefficient even if identifying, or can be unwieldy and impractical if we use a growing sieve of moments. Motivated by a variational minimax reformulation of OWGMM, we define a very general class of estimators for the conditional moment problem, which we term the variational method of moments (VMM) and which naturally enables controlling infinitely-many moments. We provide a detailed theoretical analysis of multiple VMM estimators, including based on kernel methods and neural networks, and provide appropriate conditions under which these estimators are consistent, asymptotically normal, and semiparametrically efficient in the full conditional moment model. This is in contrast to other recently proposed methods for solving conditional moment problems based on adversarial machine learning, which do not incorporate optimal weighting, do not establish asymptotic normality, and are not semiparametrically efficient.

Sparse encoding for more-interpretable feature-selecting representations in probabilistic matrix factorization Machine Learning

Dimensionality reduction methods for count data are critical to a wide range of applications in medical informatics and other fields where model interpretability is paramount. For such data, hierarchical Poisson matrix factorization (HPF) and other sparse probabilistic non-negative matrix factorization (NMF) methods are considered to be interpretable generative models. They consist of sparse transformations for decoding their learned representations into predictions. However, sparsity in representation decoding does not necessarily imply sparsity in the encoding of representations from the original data features. HPF is often incorrectly interpreted in the literature as if it possesses encoder sparsity. The distinction between decoder sparsity and encoder sparsity is subtle but important. Due to the lack of encoder sparsity, HPF does not possess the column-clustering property of classical NMF -- the factor loading matrix does not sufficiently define how each factor is formed from the original features. We address this deficiency by self-consistently enforcing encoder sparsity, using a generalized additive model (GAM), thereby allowing one to relate each representation coordinate to a subset of the original data features. In doing so, the method also gains the ability to perform feature selection. We demonstrate our method on simulated data and give an example of how encoder sparsity is of practical use in a concrete application of representing inpatient comorbidities in Medicare patients.

Inference for High-dimensional Maximin Effects in Heterogeneous Regression Models Using a Sampling Approach Machine Learning

Heterogeneity is an important feature of modern data sets and a central task is to extract information from large-scale and heterogeneous data. In this paper, we consider multiple high-dimensional linear models and adopt the definition of maximin effect (Meinshausen, B{\"u}hlmann, AoS, 43(4), 1801--1830) to summarize the information contained in this heterogeneous model. We define the maximin effect for a targeted population whose covariate distribution is possibly different from that of the observed data. We further introduce a ridge-type maximin effect to simultaneously account for reward optimality and statistical stability. To identify the high-dimensional maximin effect, we estimate the regression covariance matrix by a debiased estimator and use it to construct the aggregation weights for the maximin effect. A main challenge for statistical inference is that the estimated weights might have a mixture distribution and the resulted maximin effect estimator is not necessarily asymptotic normal. To address this, we devise a novel sampling approach to construct the confidence interval for any linear contrast of high-dimensional maximin effects. The coverage and precision properties of the proposed confidence interval are studied. The proposed method is demonstrated over simulations and a genetic data set on yeast colony growth under different environments.

Estimating mixed-memberships using the Symmetric Laplacian Inverse Matrix Machine Learning

Community detection has been well studied in network analysis, and one popular technique is spectral clustering which is fast and statistically analyzable for detect-ing clusters for given networks. But the more realistic case of mixed membership community detection remains a challenge. In this paper, we propose a new spectral clustering method Mixed-SLIM for mixed membership community detection. Mixed-SLIM is designed based on the symmetrized Laplacian inverse matrix (SLIM) (Jing et al. 2021) under the degree-corrected mixed membership (DCMM) model. We show that this algorithm and its regularized version Mixed-SLIM {\tau} are asymptotically consistent under mild conditions. Meanwhile, we provide Mixed-SLIM appro and its regularized version Mixed-SLIM {\tau}appro by approximating the SLIM matrix when dealing with large networks in practice. These four Mixed-SLIM methods outperform state-of-art methods in simulations and substantial empirical datasets for both community detection and mixed membership community detection problems.

Learning Graphons via Structured Gromov-Wasserstein Barycenters Machine Learning

We propose a novel and principled method to learn a nonparametric graph model called graphon, which is defined in an infinite-dimensional space and represents arbitrary-size graphs. Based on the weak regularity lemma from the theory of graphons, we leverage a step function to approximate a graphon. We show that the cut distance of graphons can be relaxed to the Gromov-Wasserstein distance of their step functions. Accordingly, given a set of graphs generated by an underlying graphon, we learn the corresponding step function as the Gromov-Wasserstein barycenter of the given graphs. Furthermore, we develop several enhancements and extensions of the basic algorithm, $e.g.$, the smoothed Gromov-Wasserstein barycenter for guaranteeing the continuity of the learned graphons and the mixed Gromov-Wasserstein barycenters for learning multiple structured graphons. The proposed approach overcomes drawbacks of prior state-of-the-art methods, and outperforms them on both synthetic and real-world data. The code is available at