Goto

Collaborating Authors

 mutual coherence


Understanding Counting in Small Transformers: The Interplay between Attention and Feed-Forward Layers

Behrens, Freya, Biggio, Luca, Zdeborová, Lenka

arXiv.org Artificial Intelligence

We provide a comprehensive analysis of simple transformer models trained on the histogram task, where the goal is to count the occurrences of each item in the input sequence from a fixed alphabet. Despite its apparent simplicity, this task exhibits a rich phenomenology that allows us to characterize how different architectural components contribute towards the emergence of distinct algorithmic solutions. In particular, we showcase the existence of two qualitatively different mechanisms that implement a solution, relation- and inventory-based counting. Which solution a model can implement depends non-trivially on the precise choice of the attention mechanism, activation function, memorization capacity and the presence of a beginning-of-sequence token. By introspecting learned models on the counting task, we find evidence for the formation of both mechanisms. From a broader perspective, our analysis offers a framework to understand how the interaction of different architectural components of transformer models shapes diverse algorithmic solutions and approximations.


Variational Learning ISTA

Massoli, Fabio Valerio, Louizos, Christos, Behboodi, Arash

arXiv.org Machine Learning

Compressed sensing combines the power of convex optimization techniques with a sparsity-inducing prior on the signal space to solve an underdetermined system of equations. For many problems, the sparsifying dictionary is not directly given, nor its existence can be assumed. Besides, the sensing matrix can change across different scenarios. Addressing these issues requires solving a sparse representation learning problem, namely dictionary learning, taking into account the epistemic uncertainty of the learned dictionaries and, finally, jointly learning sparse representations and reconstructions under varying sensing matrix conditions. We address both concerns by proposing a variant of the LISTA architecture. First, we introduce Augmented Dictionary Learning ISTA (A-DLISTA), which incorporates an augmentation module to adapt parameters to the current measurement setup. Then, we propose to learn a distribution over dictionaries via a variational approach, dubbed Variational Learning ISTA (VLISTA). VLISTA exploits A-DLISTA as the likelihood model and approximates a posterior distribution over the dictionaries as part of an unfolded LISTA-based recovery algorithm. As a result, VLISTA provides a probabilistic way to jointly learn the dictionary distribution and the reconstruction algorithm with varying sensing matrices. We provide theoretical and experimental support for our architecture and show that our model learns calibrated uncertainties.


CPRL - An Extension of Compressive Sensing to the Phase Retrieval Problem

Neural Information Processing Systems

While compressive sensing (CS) has been one of the most vibrant research fields in the past few years, most development only applies to linear models. This limits its application in many areas where CS could make a difference. This paper presents a novel extension of CS to the phase retrieval problem, where intensity measurements of a linear system are used to recover a complex sparse signal. We propose a novel solution using a lifting technique - CPRL, which relaxes the NP-hard problem to a nonsmooth semidefinite program. Our analysis shows that CPRL inherits many desirable properties from CS, such as guarantees for exact recovery. We further provide scalable numerical solvers to accelerate its implementation.


Revisiting RIP guarantees for sketching operators on mixture models

Belhadji, Ayoub, Gribonval, Rémi

arXiv.org Machine Learning

In the context of sketching for compressive mixture modeling, we revisit existing proofs of the Restricted Isometry Property of sketching operators with respect to certain mixtures models. After examining the shortcomings of existing guarantees, we propose an alternative analysis that circumvents the need to assume importance sampling when drawing random Fourier features to build random sketching operators. Our analysis is based on new deterministic bounds on the restricted isometry constant that depend solely on the set of frequencies used to define the sketching operator; then we leverage these bounds to establish concentration inequalities for random sketching operators that lead to the desired RIP guarantees. Our analysis also opens the door to theoretical guarantees for structured sketching with frequencies associated to fast random linear operators.


Sparse Models for Machine Learning

Lin, Jianyi

arXiv.org Artificial Intelligence

The sparse modeling is an evident manifestation capturing the parsimony principle just described, and sparse models are widespread in statistics, physics, information sciences, neuroscience, computational mathematics, and so on. In statistics the many applications of sparse modeling span regression, classification tasks, graphical model selection, sparse M-estimators and sparse dimensionality reduction. It is also particularly effective in many statistical and machine learning areas where the primary goal is to discover predictive patterns from data which would enhance our understanding and control of underlying physical, biological, and other natural processes, beyond just building accurate outcome black-box predictors. Common examples include selecting biomarkers in biological procedures, finding relevant brain activity locations which are predictive about brain states and processes based on fMRI data, and identifying network bottlenecks best explaining end-to-end performance. Moreover, the research and applications of efficient recovery of high-dimensional sparse signals from a relatively small number of observations, which is the main focus of compressed sensing or compressive sensing, have rapidly grown and became an extremely intense area of study beyond classical signal processing. Likewise interestingly, sparse modeling is directly related to various artificial vision tasks, such as image denoising, segmentation, restoration and superresolution, object or face detection and recognition in visual scenes, and action recognition. In this manuscript, we provide a brief introduction of the basic theory underlying sparse representation and compressive sensing, and then discuss some methods for recovering sparse solutions to optimization problems in effective way, together with some applications of sparse recovery in a machine learning problem known as sparse dictionary learning.


Heuristic Algorithms for the Approximation of Mutual Coherence

Betz, Gregor, Chekan, Vera, Mchedlidze, Tamara

arXiv.org Artificial Intelligence

Mutual coherence is a measure of similarity between two opinions. Although the notion comes from philosophy, it is essential for a wide range of technologies, e.g., the Wahl-O-Mat system. In Germany, this system helps voters to find candidates that are the closest to their political preferences. The exact computation of mutual coherence is highly time-consuming due to the iteration over all subsets of an opinion. Moreover, for every subset, an instance of the SAT model counting problem has to be solved which is known to be a hard problem in computer science. This work is the first study to accelerate this computation. We model the distribution of the so-called confirmation values as a mixture of three Gaussians and present efficient heuristics to estimate its model parameters. The mutual coherence is then approximated with the expected value of the distribution. Some of the presented algorithms are fully polynomial-time, others only require solving a small number of instances of the SAT model counting problem. The average squared error of our best algorithm lies below 0.0035 which is insignificant if the efficiency is taken into account. Furthermore, the accuracy is precise enough to be used in Wahl-O-Mat-like systems.


Neural Optimization Kernel: Towards Robust Deep Learning

Lyu, Yueming, Tsang, Ivor

arXiv.org Machine Learning

Recent studies show a close connection between neural networks (NN) and kernel methods. However, most of these analyses (e.g., NTK) focus on the influence of (infinite) width instead of the depth of NN models. There remains a gap between theory and practical network designs that benefit from the depth. This paper first proposes a novel kernel family named Neural Optimization Kernel (NOK). Our kernel is defined as the inner product between two $T$-step updated functionals in RKHS w.r.t. a regularized optimization problem. Theoretically, we proved the monotonic descent property of our update rule for both convex and non-convex problems, and a $O(1/T)$ convergence rate of our updates for convex problems. Moreover, we propose a data-dependent structured approximation of our NOK, which builds the connection between training deep NNs and kernel methods associated with NOK. The resultant computational graph is a ResNet-type finite width NN. Our structured approximation preserved the monotonic descent property and $O(1/T)$ convergence rate. Namely, a $T$-layer NN performs $T$-step monotonic descent updates. Notably, we show our $T$-layered structured NN with ReLU maintains a $O(1/T)$ convergence rate w.r.t. a convex regularized problem, which explains the success of ReLU on training deep NN from a NN architecture optimization perspective. For the unsupervised learning and the shared parameter case, we show the equivalence of training structured NN with GD and performing functional gradient descent in RKHS associated with a fixed (data-dependent) NOK at an infinity-width regime. For finite NOKs, we prove generalization bounds. Remarkably, we show that overparameterized deep NN (NOK) can increase the expressive power to reduce empirical risk and reduce the generalization bound at the same time. Extensive experiments verify the robustness of our structured NOK blocks.


Dataless Model Selection with the Deep Frame Potential

Murdock, Calvin, Lucey, Simon

arXiv.org Machine Learning

Choosing a deep neural network architecture is a fundamental problem in applications that require balancing performance and parameter efficiency. Standard approaches rely on ad-hoc engineering or computationally expensive validation on a specific dataset. We instead attempt to quantify networks by their intrinsic capacity for unique and robust representations, enabling efficient architecture comparisons without requiring any data. Building upon theoretical connections between deep learning and sparse approximation, we propose the deep frame potential: a measure of coherence that is approximately related to representation stability but has minimizers that depend only on network structure. This provides a framework for jointly quantifying the contributions of architectural hyper-parameters such as depth, width, and skip connections. We validate its use as a criterion for model selection and demonstrate correlation with generalization error on a variety of common residual and densely connected network architectures.


Convolutional Neural Networks Analyzed via Inverse Problem Theory and Sparse Representations

Tarhan, Cem, Akar, Gozde Bozdagi

arXiv.org Machine Learning

Inverse problems in imaging such as denoising, deblurring, superresolution (SR) have been addressed for many decades. In recent years, convolutional neural networks (CNNs) have been widely used for many inverse problem areas. Although their indisputable success, CNNs are not mathematically validated as to how and what they learn. In this paper, we prove that during training, CNN elements solve for inverse problems which are optimum solutions stored as CNN neuron filters. We discuss the necessity of mutual coherence between CNN layer elements in order for a network to converge to the optimum solution. We prove that required mutual coherence can be provided by the usage of residual learning and skip connections. We have set rules over training sets and depth of networks for better convergence, i.e. performance.


Optimized projections for compressed sensing via rank-constrained nearest correlation matrix

Cleju, Nicolae

arXiv.org Machine Learning

Optimizing the acquisition matrix is useful for compressed sensing of signals that are sparse in overcomplete dictionaries, because the acquisition matrix can be adapted to the particular correlations of the dictionary atoms. In this paper a novel formulation of the optimization problem is proposed, in the form of a rank-constrained nearest correlation matrix problem. Furthermore, improvements for three existing optimization algorithms are introduced, which are shown to be particular instances of the proposed formulation. Simulation results show notable improvements and superior robustness in sparse signal recovery. Keywords: acquisition, compressed sensing, nearest correlation matrix, optimization 1. Introduction Compressed Sensing (CS) [1] studies the possibility of acquiring a signal x that is a priori known to be sparse in some dictionary D with fewer linear measurements than required by the traditional sampling theorem. In many cases the dictionary D is an orthogonal basis, but we consider here the general case of an overcomplete dictionary.