Goto

Collaborating Authors

 Miyato, Takeru


Artificial Kuramoto Oscillatory Neurons

arXiv.org Machine Learning

We build a new neural network architecture that has iterative modules that update N-dimensional oscillatory neurons via a generalization of the well-known non-linear dynamical model called the Kuramoto model (Kuramoto, 1984). The Kuramoto model describes the synchronization of oscillators; each Kuramoto update applies forces to connected oscillators, encouraging them to become aligned or anti-aligned. This process is similar to binding in neuroscience and can be understood as distributed and continuous clustering. Thus, networks with this mechanism tend to compress their representations via synchronization. We incorporate the Kuramoto model into an artificial neural network, by applying the differential equation that describes the Kuramoto model to each individual neuron. The resulting artificial Kuramoto oscillatory neurons (AKOrN) can be combined with layer architectures such as fully connected layers, convolutions, and attention mechanisms. We explore the capabilities of AKOrN and find that its neuronal mechanism drastically changes the behavior of the network. AKOrN strongly binds object features with competitive performance to slot-based models in object discovery, enhances the reasoning capability of self-attention, and increases robustness against random, adversarial, and natural perturbations with surprisingly good calibration.


GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers

arXiv.org Machine Learning

As transformers are equivariant to the permutation of input tokens, encoding the positional information of tokens is necessary for many tasks. However, since existing positional encoding schemes have been initially designed for NLP tasks, their suitability for vision tasks, which typically exhibit different structural properties in their data, is questionable. We argue that existing positional encoding schemes are suboptimal for 3D vision tasks, as they do not respect their underlying 3D geometric structure. Based on this hypothesis, we propose a geometryaware attention mechanism that encodes the geometric structure of tokens as relative transformation determined by the geometric relationship between queries and key-value pairs. By evaluating on multiple novel view synthesis (NVS) datasets in the sparse wide-baseline multi-view setting, we show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models without any additional learned parameters and only minor computational overhead. The transformer model (Vaswani et al., 2017), which is composed of a stack of permutation symmetric layers, processes input tokens as a set and lacks direct awareness of the tokens' structural information. Consequently, transformer models are not solely perceptible to the structures of input tokens, such as word order in NLP or 2D positions of image pixels or patches in image processing. A common way to make transformers position-aware is through vector embeddings: in NLP, a typical way is to transform the position values of the word tokens into embedding vectors to be added to input tokens or attention weights (Vaswani et al., 2017; Shaw et al., 2018). While initially designed for NLP, these positional encoding techniques are widely used for 2D and 3D vision tasks today (Wang et al., 2018; Dosovitskiy et al., 2021; Sajjadi et al., 2022b; Du et al., 2023). Here, a natural question arises: "Are existing encoding schemes suitable for tasks with very different geometric structures?". Consider for example 3D vision tasks using multi-view images paired with camera transformations. The 3D Euclidean symmetry behind multi-view images is a more intricate structure than the 1D sequence of words. With the typical vector embedding approach, the model is tasked with uncovering useful camera poses embedded in the tokens and consequently struggles to understand the effect of non-commutative Euclidean transformations.


Neural Fourier Transform: A General Approach to Equivariant Representation Learning

arXiv.org Artificial Intelligence

Symmetry learning has proven to be an effective approach for extracting the hidden structure of data, with the concept of equivariance relation playing the central role. However, most of the current studies are built on architectural theory and corresponding assumptions on the form of data. We propose Neural Fourier Transform (NFT), a general framework of learning the latent linear action of the group without assuming explicit knowledge of how the group acts on data. We present the theoretical foundations of NFT and show that the existence of a linear equivariant feature, which has been assumed ubiquitously in equivariance learning, is equivalent to the existence of a group invariant kernel on the dataspace. We also provide experimental results to demonstrate the application of NFT in typical scenarios with varying levels of knowledge about the acting group.


Contrastive Representation Learning with Trainable Augmentation Channel

arXiv.org Machine Learning

In contrastive representation learning, data representation is trained so that it can classify the image instances even when the images are altered by augmentations. However, depending on the datasets, some augmentations can damage the information of the images beyond recognition, and such augmentations can result in collapsed representations. We present a partial solution to this problem by formalizing a stochastic encoding process in which there exist a tug-of-war between the data corruption introduced by the augmentations and the information preserved by the encoder. We show that, with the infoMax objective based on this framework, we can learn a data-dependent distribution of augmentations to avoid the collapse of the representation.


Robustness to Adversarial Perturbations in Learning from Incomplete Data

arXiv.org Machine Learning

Robustness to adversarial perturbations has become an essential feature in the design of modern classifiers --in particular, of deep neural networks. This phenomenon originates from several empirical observations, such as [1] and [2], which show deep networks are vulnerable to adversarial attacks in the input space. So far, plenty of novel methodologies have been introduced to compensate for this shortcoming. Adversarial Training (AT) [3], Virtual AT [4] or Distillation [5] are just examples of some promising methods in this area. The majority of these approaches seek an effective defense against a point-wise adversary, who shifts input data-points toward adversarial directions, in a separate manner. However, as shown by [6], a distributional adversary who can shift the data distribution instead of the input data-points is provably more detrimental to learning. This suggests that one can greatly improve the robustness of a classifier by improving its defense against a distributional adversary rather than a point-wise one. This motivation has led to the development of Distributionally Robust Learning (DRL) [7], which has attracted intensive research interest over the last few years [8, 9, 10, 11]. Despite of all the advancements in supervised or unsupervised DRL, the amount of researches tackling this problem from a semi-supervised angle is slim to none [12].


Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning

arXiv.org Machine Learning

We propose a new regularization method based on virtual adversarial loss: a new measure of local smoothness of the conditional label distribution given input. Virtual adversarial loss is defined as the robustness of the conditional label distribution around each input data point against local perturbation. Unlike adversarial training, our method defines the adversarial direction without label information and is hence applicable to semi-supervised learning. Because the directions in which we smooth the model are only "virtually" adversarial, we call our method virtual adversarial training (VAT). The computational cost of VAT is relatively low. For neural networks, the approximated gradient of virtual adversarial loss can be computed with no more than two pairs of forward- and back-propagations. In our experiments, we applied VAT to supervised and semi-supervised learning tasks on multiple benchmark datasets. With a simple enhancement of the algorithm based on the entropy minimization principle, our VAT achieves state-of-the-art performance for semi-supervised learning tasks on SVHN and CIFAR-10.


Neural Multi-scale Image Compression

arXiv.org Machine Learning

This study presents a new lossy image compression method that utilizes the multi-scale features of natural images. Our model consists of two networks: multi-scale lossy autoencoder and parallel multi-scale lossless coder. The multi-scale lossy autoencoder extracts the multi-scale image features to quantized variables and the parallel multi-scale lossless coder enables rapid and accurate lossless coding of the quantized variables via encoding/decoding the variables in parallel. Our proposed model achieves comparable performance to the state-of-the-art model on Kodak and RAISE-1k dataset images, and it encodes a PNG image of size $768 \times 512$ in 70 ms with a single GPU and a single CPU process and decodes it into a high-fidelity image in approximately 200 ms.


Spectral Normalization for Generative Adversarial Networks

arXiv.org Machine Learning

One of the challenges in the study of generative adversarial networks is the instability of its training. In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our new normalization technique is computationally light and easy to incorporate into existing implementations. We tested the efficacy of spectral normalization on CIFAR10, STL-10, and ILSVRC2012 dataset, and we experimentally confirmed that spectrally normalized GANs (SN-GANs) is capable of generating images of better or equal quality relative to the previous training stabilization techniques.


cGANs with Projection Discriminator

arXiv.org Machine Learning

We propose a novel, projection based way to incorporate the conditional information into the discriminator of GANs that respects the role of the conditional information in the underlining probabilistic model. This approach is in contrast with most frameworks of conditional GANs used in application today, which use the conditional information by concatenating the (embedded) conditional vector to the feature vectors. With this modification, we were able to significantly improve the quality of the class conditional image generation on ILSVRC2012 (ImageNet) 1000-class image dataset from the current state-of-the-art result, and we achieved this with a single pair of a discriminator and a generator. We were also able to extend the application to super-resolution and succeeded in producing highly discriminative super-resolution images. This new structure also enabled high quality category transformation based on parametric functional transformation of conditional batch normalization layers in the generator.


Learning Discrete Representations via Information Maximizing Self-Augmented Training

arXiv.org Machine Learning

Learning discrete representations of data is a central machine learning task because of the compactness of the representations and ease of interpretation. The task includes clustering and hash learning as special cases. Deep neural networks are promising to be used because they can model the non-linearity of data and scale to large datasets. However, their model complexity is huge, and therefore, we need to carefully regularize the networks in order to learn useful representations that exhibit intended invariance for applications of interest. To this end, we propose a method called Information Maximizing Self-Augmented Training (IMSAT). In IMSAT, we use data augmentation to impose the invariance on discrete representations. More specifically, we encourage the predicted representations of augmented data points to be close to those of the original data points in an end-to-end fashion. At the same time, we maximize the information-theoretic dependency between data and their predicted discrete representations. Extensive experiments on benchmark datasets show that IMSAT produces state-of-the-art results for both clustering and unsupervised hash learning.