Kato, Keizo
Rethinking VLMs and LLMs for Image Classification
Cooper, Avi, Kato, Keizo, Shih, Chia-Hsien, Yamane, Hiroaki, Vinken, Kasper, Takemoto, Kentaro, Sunagawa, Taro, Yeh, Hao-Wei, Yamanaka, Jin, Mason, Ian, Boix, Xavier
Visual Language Models (VLMs) are now increasingly being merged with Large Language Models (LLMs) to enable new capabilities, particularly in terms of improved interactivity and open-ended responsiveness. While these are remarkable capabilities, the contribution of LLMs to enhancing the longstanding key problem of classifying an image among a set of choices remains unclear. Through extensive experiments involving seven models, ten visual understanding datasets, and multiple prompt variations per dataset, we find that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do. Yet at the same time, leveraging LLMs can improve performance on tasks requiring reasoning and outside knowledge. In response to these challenges, we propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task. The LLM router undergoes training using a dataset constructed from more than 2.5 million examples of pairs of visual task and model accuracy. Our results reveal that this lightweight fix surpasses or matches the accuracy of state-of-the-art alternatives, including GPT-4V and HuggingGPT, while improving cost-effectiveness.
Quantitative Understanding of VAE by Interpreting ELBO as Rate Distortion Cost of Transform Coding
Nakagawa, Akira, Kato, Keizo
Variational autoencoder (VAE) estimates the posterior parameters (mean and variance) of latent variables corresponding to each input data. While it is used for many tasks, the transparency of the model is still an underlying issue. This paper provides a quantitative understanding of VAE property by interpreting VAE as a non-linearly scaled isometric embedding. According to the Rate-distortion theory, the optimal transform coding is achieved by using a PCA-like orthonormal transform where the transform space is isometric to the input. From this analogy, we show theoretically and experimentally that VAE can be mapped to an implicit isometric embedding with a scale factor derived from the posterior parameter. As a result, we can estimate the data probabilities in the input space from the prior, loss metrics, and corresponding posterior parameters. In addition, the quantitative importance of each latent variable can be evaluated like the eigenvalue of PCA. Variational autoencoder (VAE) (Kingma & Welling, 2014) is one of the most successful generative models, estimating posterior parameters of latent variables for each input data. In VAE, the latent representation is obtained by maximizing an evidence lower bound (ELBO). A number of studies (Higgins et al., 2017; Kim & Mnih, 2018; Lopez et al., 2018; Chen et al., 2018; Locatello et al., 2019; Rolínek et al., 2019) have tried to reveal the property of latent variables. To maximize ELBO, Alemi et al. (2018) analysed the rate-distortion (RD) tradeoff. However, the quantitative behavior of the latent space at the optimum RD tradeoff condition is still not clarified well. RD theory (Berger, 1971), which is successfully applied to image compression, formulates that a PCA-like orthonormal transform with uniform coding noise optimizes the RD tradeoff.
Rate-Distortion Optimization Guided Autoencoder for Generative Approach with quantitatively measurable latent space
Kato, Keizo, Zhou, Jing, Nakagawa, Akira
A BSTRACT In the generative model approach of machine learning, it is essential to acquire an accurate probabilistic model and compress the dimension of data for easy treatment. However, in the conventional deep-autoencoder based generative model such as V AE, the probability of the real space cannot be obtained correctly from that of in the latent space, because the scaling between both spaces is not controlled. This has also been an obstacle to quantifying the impact of the variation of latent variables on data. In this paper, we propose Rate-Distortion Optimization guided autoencoder, in which the Jacobi matrix from real space to latent space has orthonormality. It is proved theoretically and experimentally that (i) the probability distribution of the latent space obtained by this model is proportional to the probability distribution of the real space because Jacobian between two spaces is constant; (ii) our model behaves as nonlinear PCA, where energy of acquired latent space is concentrated on several principal components and the influence of each component can be evaluated quantitatively. Furthermore, to verify the usefulness on the practical application, we evaluate its performance in unsupervised anomaly detection and it outperforms current state-of-the-art methods. 1 I NTRODUCTION Capturing the inherent features of a dataset from high-dimensional and complex data is an essential issue in machine learning. Generative model approach learns the probability distribution of data, aiming at data generation by probabilistic sampling, unsupervised/weakly supervised learning, and acquiring meta-prior (general assumptions about how data can be summarized naturally, such as disentangle, clustering, and hierarchical structure (Bengio et al., 2013; Tschannen et al., 2019)). It is generally difficult to directly estimate a probability density function(PDF) Px (x) of real data x. Accordingly, one promising approach is to map to the latent space z with reduced dimension and capture PDF Pz (z) . In recent years, deep autoencoder based methods have made it possible to compress dimensions and derive latent variables. While there is remarkable progress in these areas (van den Oord et al., 2017; Kingma et al., 2014; Jiang et al., 2016), the relation between x and z in the current deep generative models is still not clear. V AE (P .Kingma & Welling, 2014) is one of the most successful generative models for capturing latent representation. In V AE, lower bound of log-likelihood of Px (x) is introduced as ELBO. Then latent variable is obtained by maximizing ELBO.