Goto

Collaborating Authors

 affine transform


On Uniformly Scaling Flows: A Density-Aligned Approach to Deep One-Class Classification

Zaid, Faried Abu, Katzke, Tim, Müller, Emmanuel, Neider, Daniel

arXiv.org Artificial Intelligence

Unsupervised anomaly detection is often framed around two widely studied paradigms. Deep one-class classification, exemplified by Deep SVDD, learns compact latent representations of normality, while density estimators realized by normalizing flows directly model the likelihood of nominal data. In this work, we show that uniformly scaling flows (USFs), normalizing flows with a constant Jacobian determinant, precisely connect these approaches. Specifically, we prove how training a USF via maximum-likelihood reduces to a Deep SVDD objective with a unique regularization that inherently prevents representational collapse. This theoretical bridge implies that USFs inherit both the density faithfulness of flows and the distance-based reasoning of one-class methods. We further demonstrate that USFs induce a tighter alignment between negative log-likelihood and latent norm than either Deep SVDD or non-USFs, and how recent hybrid approaches combining one-class objectives with VAEs can be naturally extended to USFs. Consequently, we advocate using USFs as a drop-in replacement for non-USFs in modern anomaly detection architectures. Empirically, this substitution yields consistent performance gains and substantially improved training stability across multiple benchmarks and model backbones for both image-level and pixel-level detection. These results unify two major anomaly detection paradigms, advancing both theoretical understanding and practical performance.


Weakly Supervised Representation Learning with Sparse Perturbations Kartik Ahuja Jason Hartford

Neural Information Processing Systems

The theory of representation learning aims to build methods that provably invert the data generating process with minimal domain knowledge or any source of supervision. Most prior approaches require strong distributional assumptions on the latent variables and weak supervision (auxiliary information such as timestamps) to provide provable identification guarantees. In this work, we show that if one has weak supervision from observations generated by sparse perturbations of the latent variables-e.g.


Reviews: Unsupervised Depth Estimation, 3D Face Rotation and Replacement

Neural Information Processing Systems

In particular, the method estimates the depth of the 2D keypoints of the source images using information from both images, and the method estimates the 3D-to-2D affine transform from the source to the target. With this transformation, a traditional keypoint-based face warping (implemented in OpenGL) algorithm and CycleGAN are used to map the source image to the target image. Note that the estimation of the depth and affine transform can either depends on only the 2D keypoints or both the keypoints and images.


Transformers are Universal In-context Learners

Furuya, Takashi, de Hoop, Maarten V., Peyré, Gabriel

arXiv.org Machine Learning

Transformers are deep architectures that define "in-context mappings" which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for vision transformers). This work studies in particular the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically and uniformly address the expressivity of these architectures, we consider the case that the mappings are conditioned on a context represented by a probability distribution of tokens (discrete for a finite number of tokens). The related notion of smoothness corresponds to continuity in terms of the Wasserstein distance between these contexts. We demonstrate that deep transformers are universal and can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains. A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens. Additionally, it operates with a fixed embedding dimension of tokens (this dimension does not increase with precision) and a fixed number of heads (proportional to the dimension). The use of MLP layers between multi-head attention layers is also explicitly controlled.


Anatomy-aware and acquisition-agnostic joint registration with SynthMorph

Hoffmann, Malte, Hoopes, Andrew, Greve, Douglas N., Fischl, Bruce, Dalca, Adrian V.

arXiv.org Artificial Intelligence

Affine image registration is a cornerstone of medical-image analysis. While classical algorithms can achieve excellent accuracy, they solve a time-consuming optimization for every image pair. Deep-learning (DL) methods learn a function that maps an image pair to an output transform. Evaluating the function is fast, but capturing large transforms can be challenging, and networks tend to struggle if a test-image characteristic shifts from the training domain, such as resolution. Most affine methods are agnostic to anatomy, meaning the registration will be inaccurate if algorithms consider all structures in the image. We address these shortcomings with SynthMorph, an easy-to-use DL tool for joint affine-deformable registration of any brain image without preprocessing, right off the MRI scanner. First, we leverage a strategy to train networks with wildly varying images synthesized from label maps, yielding robust performance across acquisition specifics unseen at training. Second, we optimize the spatial overlap of select anatomical labels. This enables networks to distinguish anatomy of interest from irrelevant structures, removing the need for preprocessing that excludes content which would impinge on anatomy-specific registration. Third, we combine the affine model with a deformable hypernetwork that lets users choose the optimal deformation-field regularity for their specific data, at registration time, in a fraction of the time required by classical methods. We rigorously analyze how competing architectures learn affine transforms and compare state-of-the-art registration tools across an extremely diverse set of neuroimaging data, aiming to truly capture the behavior of methods in the real world. SynthMorph demonstrates consistent and improved accuracy. It is available at https://w3id.org/synthmorph, as a single complete end-to-end solution for registration of brain MRI.


Transformation-Based Models of Video Sequences

van Amersfoort, Joost, Kannan, Anitha, Ranzato, Marc'Aurelio, Szlam, Arthur, Tran, Du, Chintala, Soumith

arXiv.org Artificial Intelligence

In this work we propose a simple unsupervised approach for next frame prediction in video. Instead of directly predicting the pixels in a frame given past frames, we predict the transformations needed for generating the next frame in a sequence, given the transformations of the past frames. This leads to sharper results, while using a smaller prediction model. In order to enable a fair comparison between different video frame prediction models, we also propose a new evaluation protocol. We use generated frames as input to a classifier trained with ground truth sequences. This criterion guarantees that models scoring high are those producing sequences which preserve discriminative features, as opposed to merely penalizing any deviation, plausible or not, from the ground truth. Our proposed approach compares favourably against more sophisticated ones on the UCF-101 data set, while also being more efficient in terms of the number of parameters and computational cost. There has been an increased interest in unsupervised learning of representations from video sequences (Mathieu et al., 2016; Srivastava et al., 2015; Vondrick et al., 2016). A popular formulation of the task is to learn to predict a small number of future frames given the previous K frames; the motivation being that predicting future frames requires understanding how objects interact and what plausible sequences of motion are. These methods directly aim to predict pixel values, with either MSE loss or adversarial loss.


On an Interpretation of ResNets via Solution Constructions

Huang, Changcun

arXiv.org Artificial Intelligence

He et al. (2016a) introduced a type of shortcut connection in the architecture of a feedforward neural network, which has been proved effective in the learning of particularly deep neural networks. The modified architecture is called residual network (ResNet), which is widely applied and nearly becomes a standard component of network architectures, such as in Transformer (Vaswani et al., 2017). Note that in He et al. (2016b), another proposed shortcut connection is slightly different from that of He et al. (2016a) in whether or not a ReLU is used after an addition operation. Both of the above two versions are called ResNet and this paper will study the former one.


Universal Solutions of Feedforward ReLU Networks for Interpolations

Huang, Changcun

arXiv.org Artificial Intelligence

This paper provides a theoretical framework on the solution of feedforward ReLU networks for interpolations, in terms of what is called an interpolation matrix, which is the summary, extension and generalization of our three preceding works, with the expectation that the solution of engineering could be included in this framework and finally understood. To three-layer networks, we classify different kinds of solutions and model them in a normalized form; the solution finding is investigated by three dimensions, including data, networks and the training; the mechanism of a type of overparameterization solution is interpreted. To deep-layer networks, we present a general result called sparse-matrix principle, which could describe some basic behavior of deep layers and explain the phenomenon of the sparse-activation mode that appears in engineering applications associated with brain science; an advantage of deep layers compared to shallower ones is manifested in this principle. As applications, a general solution of deep neural networks for classifications is constructed by that principle; and we also use the principle to study the data-disentangling property of encoders. Analogous to the three-layer case, the solution of deep layers is also explored through several dimensions. The mechanism of multi-output neural networks is explained from the perspective of interpolation matrices.


Theoretical Exploration of Solutions of Feedforward ReLU networks

Huang, Changcun

arXiv.org Artificial Intelligence

This paper aims to interpret the mechanism of feedforward ReLU networks by exploring their solutions for piecewise linear functions through basic rules. The constructed solutions should be universal enough to explain the network architectures of engineering. In order for that, we borrow the methodology of theoretical physics to develop the theories. Some of the consequences of our theories include: Under geometric backgrounds, the solutions of both three-layer networks and deep-layer networks are presented, and the solution universality is ensured by several ways; We give clear and intuitive interpretations of each component of network architectures, such as the parameter-sharing mechanism for multi-output, the function of each layer, the advantage of deep layers, the redundancy of parameters, and so on. We explain three typical network architectures: the subnetwork of last three layers of convolutional networks, multi-layer feedforward networks, and the decoder of autoencoders. This paper is expected to provide a basic foundation of theories of feedforward ReLU networks for further investigations.


Convolutional Normalization: Improving Deep Convolutional Network Robustness and Training

Liu, Sheng, Li, Xiao, Zhai, Yuexiang, You, Chong, Zhu, Zhihui, Fernandez-Granda, Carlos, Qu, Qing

arXiv.org Artificial Intelligence

Normalization techniques have become a basic component in modern convolutional neural networks (ConvNets). In particular, many recent works demonstrate that promoting the orthogonality of the weights helps train deep models and improve robustness. For ConvNets, most existing methods are based on penalizing or normalizing weight matrices derived from concatenating or flattening the convolutional kernels. These methods often destroy or ignore the benign convolutional structure of the kernels; therefore, they are often expensive or impractical for deep ConvNets. In contrast, we introduce a simple and efficient ``convolutional normalization'' method that can fully exploit the convolutional structure in the Fourier domain and serve as a simple plug-and-play module to be conveniently incorporated into any ConvNets. Our method is inspired by recent work on preconditioning methods for convolutional sparse coding and can effectively promote each layer's channel-wise isometry. Furthermore, we show that convolutional normalization can reduce the layerwise spectral norm of the weight matrices and hence improve the Lipschitzness of the network, leading to easier training and improved robustness for deep ConvNets. Applied to classification under noise corruptions and generative adversarial network (GAN), we show that convolutional normalization improves the robustness of common ConvNets such as ResNet and the performance of GAN. We verify our findings via extensive numerical experiments on CIFAR-10, CIFAR-100, and ImageNet.