Banff
A Framework for Overparameterized Learning
Terjék, Dávid, González-Sánchez, Diego
A candidate explanation of the good empirical performance of deep neural networks is the implicit regularization effect of first order optimization methods. Inspired by this, we prove a convergence theorem for nonconvex composite optimization, and apply it to a general learning problem covering many machine learning applications, including supervised learning. We then present a deep multilayer perceptron model and prove that, when sufficiently wide, it $(i)$ leads to the convergence of gradient descent to a global optimum with a linear rate, $(ii)$ benefits from the implicit regularization effect of gradient descent, $(iii)$ is subject to novel bounds on the generalization error, $(iv)$ exhibits the lazy training phenomenon and $(v)$ enjoys learning rate transfer across different widths. The corresponding coefficients, such as the convergence rate, improve as width is further increased, and depend on the even order moments of the data generating distribution up to an order depending on the number of layers. The only non-mild assumption we make is the concentration of the smallest eigenvalue of the neural tangent kernel at initialization away from zero, which has been shown to hold for a number of less general models in contemporary works. We present empirical evidence supporting this assumption as well as our theoretical claims.
An Upper Bound for the Distribution Overlap Index and Its Applications
Fu, Hao, Krishnamurthy, Prashanth, Garg, Siddharth, Khorrami, Farshad
This paper proposes an easy-to-compute upper bound for the overlap index between two probability distributions without requiring any knowledge of the distribution models. The computation of our bound is time-efficient and memory-efficient and only requires finite samples. The proposed bound shows its value in one-class classification and domain shift analysis. Specifically, in one-class classification, we build a novel one-class classifier by converting the bound into a confidence score function. Unlike most one-class classifiers, the training process is not needed for our classifier. Additionally, the experimental results show that our classifier can be accurate with only a small number of in-class samples and outperform many state-of-the-art methods on various datasets in different one-class classification scenarios. In domain shift analysis, we propose a theorem based on our bound. The theorem is useful in detecting the existence of domain shift and inferring data information. The detection and inference processes are both computation-efficient and memory-efficient. Our work shows significant promise toward broadening the applications of overlap-based metrics.
Span-based Named Entity Recognition by Generating and Compressing Information
Nguyen, Nhung T. H., Miwa, Makoto, Ananiadou, Sophia
The information bottleneck (IB) principle has been proven effective in various NLP applications. The existing work, however, only used either generative or information compression models to improve the performance of the target task. In this paper, we propose to combine the two types of IB models into one system to enhance Named Entity Recognition (NER). For one type of IB model, we incorporate two unsupervised generative components, span reconstruction and synonym generation, into a span-based NER system. The span reconstruction ensures that the contextualised span representation keeps the span information, while the synonym generation makes synonyms have similar representations even in different contexts. For the other type of IB model, we add a supervised IB layer that performs information compression into the system to preserve useful features for NER in the resulting span representations. Experiments on five different corpora indicate that jointly training both generative and information compression models can enhance the performance of the baseline span-based NER system. Our source code is publicly available at https://github.com/nguyennth/joint-ib-models.
Bayesian MRI Reconstruction with Joint Uncertainty Estimation using Diffusion Models
Luo, Guanxiong, Blumenthal, Moritz, Heide, Martin, Uecker, Martin
We introduce a framework that enables efficient sampling from learned probability distributions for MRI reconstruction. Different from conventional deep learning-based MRI reconstruction techniques, samples are drawn from the posterior distribution given the measured k-space using the Markov chain Monte Carlo (MCMC) method. In addition to the maximum a posteriori (MAP) estimate for the image, which can be obtained with conventional methods, the minimum mean square error (MMSE) estimate and uncertainty maps can also be computed. The data-driven Markov chains are constructed from the generative model learned from a given image database and are independent of the forward operator that is used to model the k-space measurement. This provides flexibility because the method can be applied to k-space acquired with different sampling schemes or receive coils using the same pre-trained models. Furthermore, we use a framework based on a reverse diffusion process to be able to utilize advanced generative models. The performance of the method is evaluated on an open dataset using 10-fold undersampling in k-space.
CQnet: convex-geometric interpretation and constraining neural-network trajectories
We introduce CQnet, a neural network with origins in the CQ algorithm for solving convex split-feasibility problems and forward-backward splitting. CQnet's trajectories are interpretable as particles that are tracking a changing constraint set via its point-to-set distance function while being elements of another constraint set at every layer. More than just a convex-geometric interpretation, CQnet accommodates learned and deterministic constraints that may be sample or data-specific and are satisfied by every layer and the output. Furthermore, the states in CQnet progress toward another constraint set at every layer. We provide proof of stability/nonexpansiveness with minimal assumptions. The combination of constraint handling and stability put forward CQnet as a candidate for various tasks where prior knowledge exists on the network states or output.
Structure-informed Language Models Are Protein Designers
Zheng, Zaixiang, Deng, Yifan, Xue, Dongyu, Zhou, Yi, YE, Fei, Gu, Quanquan
This paper demonstrates that language models are strong structure-based protein designers. We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. During inference, iterative refinement is performed to effectively optimize the generated protein sequences. Experiments show that LM-Design improves the state-of-the-art results by a large margin, leading to up to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65%/56.63% on CATH 4.2/4.3 single-chain benchmarks, and >60% when designing protein complexes). We provide extensive and in-depth analyses, which verify that LM-Design can (1) indeed leverage both structural and sequential knowledge to accurately handle structurally non-deterministic regions, (2) benefit from scaling data and model size, and (3) generalize to other proteins (e.g., antibodies and de novo proteins)
Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend
Lu, Ning, Liu, Shengcai, Zhang, Zhirui, Wang, Qi, Liu, Haifeng, Tang, Ke
Word-level textual adversarial attacks have achieved striking performance in fooling natural language processing models. However, the fundamental questions of why these attacks are effective, and the intrinsic properties of the adversarial examples (AEs), are still not well understood. This work attempts to interpret textual attacks through the lens of $n$-gram frequency. Specifically, it is revealed that existing word-level attacks exhibit a strong tendency toward generation of examples with $n$-gram frequency descend ($n$-FD). Intuitively, this finding suggests a natural way to improve model robustness by training the model on the $n$-FD examples. To verify this idea, we devise a model-agnostic and gradient-free AE generation approach that relies solely on the $n$-gram frequency information, and further integrate it into the recently proposed convex hull framework for adversarial training. Surprisingly, the resultant method performs quite similarly to the original gradient-based method in terms of model robustness. These findings provide a human-understandable perspective for interpreting word-level textual adversarial attacks, and a new direction to improve model robustness.
Learning disentangled representations for explainable chest X-ray classification using Dirichlet VAEs
Harkness, Rachael, Frangi, Alejandro F, Zucker, Kieran, Ravikumar, Nishant
This study explores the use of the Dirichlet Variational Autoencoder (DirVAE) for learning disentangled latent representations of chest X-ray (CXR) images. Our working hypothesis is that distributional sparsity, as facilitated by the Dirichlet prior, will encourage disentangled feature learning for the complex task of multi-label classification of CXR images. The DirVAE is trained using CXR images from the CheXpert database, and the predictive capacity of multi-modal latent representations learned by DirVAE models is investigated through implementation of an auxiliary multi-label classification task, with a view to enforce separation of latent factors according to class-specific features. The predictive performance and explainability of the latent space learned using the DirVAE were quantitatively and qualitatively assessed, respectively, and compared with a standard Gaussian prior-VAE (GVAE). We introduce a new approach for explainable multi-label classification in which we conduct gradient-guided latent traversals for each class of interest. Study findings indicate that the DirVAE is able to disentangle latent factors into class-specific visual features, a property not afforded by the GVAE, and achieve a marginal increase in predictive performance relative to GVAE. We generate visual examples to show that our explainability method, when applied to the trained DirVAE, is able to highlight regions in CXR images that are clinically relevant to the class(es) of interest and additionally, can identify cases where classification relies on spurious feature correlations.
Flow-matching -- efficient coarse-graining of molecular dynamics without forces
Köhler, Jonas, Chen, Yaoyi, Krämer, Andreas, Clementi, Cecilia, Noé, Frank
Coarse-grained (CG) molecular simulations have become a standard tool to study molecular processes on time- and length-scales inaccessible to all-atom simulations. Parameterizing CG force fields to match all-atom simulations has mainly relied on force-matching or relative entropy minimization, which require many samples from costly simulations with all-atom or CG resolutions, respectively. Here we present flow-matching, a new training method for CG force fields that combines the advantages of both methods by leveraging normalizing flows, a generative deep learning method. Flow-matching first trains a normalizing flow to represent the CG probability density, which is equivalent to minimizing the relative entropy without requiring iterative CG simulations. Subsequently, the flow generates samples and forces according to the learned distribution in order to train the desired CG free energy model via force matching. Even without requiring forces from the all-atom simulations, flow-matching outperforms classical force-matching by an order of magnitude in terms of data efficiency, and produces CG models that can capture the folding and unfolding transitions of small proteins.
Robust Empirical Risk Minimization with Tolerance
Bhattacharjee, Robi, Hopkins, Max, Kumar, Akash, Yu, Hantao, Chaudhuri, Kamalika
Developing simple, sample-efficient learning algorithms for robust classification is a pressing issue in today's tech-dominated world, and current theoretical techniques requiring exponential sample complexity and complicated improper learning rules fall far from answering the need. In this work we study the fundamental paradigm of (robust) $\textit{empirical risk minimization}$ (RERM), a simple process in which the learner outputs any hypothesis minimizing its training error. RERM famously fails to robustly learn VC classes (Montasser et al., 2019a), a bound we show extends even to `nice' settings such as (bounded) halfspaces. As such, we study a recent relaxation of the robust model called $\textit{tolerant}$ robust learning (Ashtiani et al., 2022) where the output classifier is compared to the best achievable error over slightly larger perturbation sets. We show that under geometric niceness conditions, a natural tolerant variant of RERM is indeed sufficient for $\gamma$-tolerant robust learning VC classes over $\mathbb{R}^d$, and requires only $\tilde{O}\left( \frac{VC(H)d\log \frac{D}{\gamma\delta}}{\epsilon^2}\right)$ samples for robustness regions of (maximum) diameter $D$.