Borji, Ali
A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on Recursively Generated Data'
Borji, Ali
The study conducted by Shumailov et al. (2024) demonstrates that repeatedly training a generative model on synthetic data leads to model collapse. This finding has generated considerable interest and debate, particularly given that current models have nearly exhausted the available data. In this work, we investigate the effects of fitting a distribution (through Kernel Density Estimation, or KDE) or a model to the data, followed by repeated sampling from it. Our objective is to develop a theoretical understanding of the phenomenon observed by Shumailov et al. (2024). Our results indicate that the outcomes reported are a statistical phenomenon and may be unavoidable.
Addressing a fundamental limitation in deep vision models: lack of spatial attention
Borji, Ali
The primary aim of this manuscript is to underscore a significant limitation in current deep learning models, particularly vision models. Unlike human vision, which efficiently selects only the essential visual areas for further processing, leading to high speed and low energy consumption, deep vision models process the entire image. In this work, we examine this issue from a broader perspective and propose a solution that could pave the way for the next generation of more efficient vision models. Basically, convolution and pooling operations are selectively applied to altered regions, with a change map sent to subsequent layers. This map indicates which computations need to be repeated. The code is available at https://github.com/aliborji/spatial_attention.
Qualitative Failures of Image Generation Models and Their Application in Detecting Deepfakes
Borji, Ali
The ability of image and video generation models to create photorealistic images has reached unprecedented heights, making it difficult to distinguish between real and fake images in many cases. However, despite this progress, a gap remains between the quality of generated images and those found in the real world. To address this, we have reviewed a vast body of literature from both academic publications and social media to identify qualitative shortcomings in image generation models, which we have classified into five categories. By understanding these failures, we can identify areas where these models need improvement, as well as develop strategies for detecting deep fakes. The prevalence of deep fakes in today's society is a serious concern, and our findings can help mitigate their negative impact.
FLORIDA: Fake-looking Real Images Dataset
Borji, Ali
Although extensive research has been carried out to evaluate the effectiveness of AI tools and models in detecting deep fakes, the question remains unanswered regarding whether these models can accurately identify genuine images that appear artificial. In this study, as an initial step towards addressing this issue, we have curated a dataset of 510 genuine images that exhibit a fake appearance and conducted an assessment using two AI models. We show that two models exhibited subpar performance when applied to our dataset. Additionally, our dataset can serve as a valuable tool for assessing the ability of deep learning models to comprehend complex visual stimuli. We anticipate that this research will stimulate further discussions and investigations in this area. Our dataset is accessible at https://github.com/aliborji/FLORIDA.
Key-Value Transformer
Borji, Ali
Transformers have emerged as the prevailing standard solution for various AI tasks, including computer vision and natural language processing. The widely adopted Query, Key, and Value formulation (QKV) has played a significant role in this. Nevertheless, no research has examined the essentiality of these three components for transformer performance. Therefore, we conducted an evaluation of the key-value formulation (KV), which generates symmetric attention maps, along with an asymmetric version that incorporates a 2D positional encoding into the attention matrix. Remarkably, this transformer requires fewer parameters and computation than the original one. Through experiments encompassing three task types -- synthetics (such as reversing or sorting a list), vision (mnist or cifar classification), and NLP (character generation and translation) -- we discovered that the KV transformer occasionally outperforms the QKV transformer. However, it also exhibits instances of underperformance compared to QKV, making it challenging to draw a definitive conclusion. Nonetheless, we consider the reported results to be encouraging and anticipate that they may pave the way for more efficient transformers in the future.
A Categorical Archive of ChatGPT Failures
Borji, Ali
Large language models have been demonstrated to be valuable in different fields. ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation by comprehending context and generating appropriate responses. It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries, with fluent and comprehensive answers surpassing prior public chatbots in both security and usefulness. However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study. Eleven categories of failures, including reasoning, factual errors, math, coding, and bias, are presented and discussed. The risks, limitations, and societal implications of ChatGPT are also highlighted. The goal of this study is to assist researchers and developers in enhancing future language models and chatbots.
Diverse, Difficult, and Odd Instances (D2O): A New Test Set for Object Classification
Borji, Ali
Test sets are an integral part of evaluating models and gauging progress in object recognition, and more broadly in computer vision and AI. Existing test sets for object recognition, however, suffer from shortcomings such as bias towards the ImageNet characteristics and idiosyncrasies (e.g., ImageNet-V2), being limited to certain types of stimuli (e.g., indoor scenes in ObjectNet), and underestimating the model performance (e.g., ImageNet-A). To mitigate these problems, we introduce a new test set, called D2O, which is sufficiently different from existing test sets. Images are a mix of generated images as well as images crawled from the web. They are diverse, unmodified, and representative of real-world scenarios and cause state-of-the-art models to misclassify them with high confidence. To emphasize generalization, our dataset by design does not come paired with a training set. It contains 8,060 images spread across 36 categories, out of which 29 appear in ImageNet. The best Top-1 accuracy on our dataset is around 60% which is much lower than 91% best Top-1 accuracy on ImageNet. We find that popular vision APIs perform very poorly in detecting objects over D2O categories such as ``faces'', ``cars'', and ``cats''. Our dataset also comes with a ``miscellaneous'' category, over which we test the image tagging models. Overall, our investigations demonstrate that the D2O test set contain a mix of images with varied levels of difficulty and is predictive of the average-case performance of models. It can challenge object recognition models for years to come and can spur more research in this fundamental area.
BinaryVQA: A Versatile Test Set to Evaluate the Out-of-Distribution Generalization of VQA Models
Borji, Ali
We introduce a new test set for visual question answering (VQA) called BinaryVQA to push the limits of VQA models. Our dataset includes 7,800 questions across 1,024 images and covers a wide variety of objects, topics, and concepts. For easy model evaluation, we only consider binary questions. Questions and answers are formulated and verified carefully and manually. Around 63% of the questions have positive answers. The median number of questions per image and question length are 7 and 5, respectively. The state of the art OFA model achieves 75% accuracy on BinaryVQA dataset, which is significantly lower than its performance on the VQA v2 test-dev dataset (94.7%). We also analyze the model behavior along several dimensions including: a) performance over different categories such as text, counting and gaze direction, b) model interpretability, c) the effect of question length on accuracy, d) bias of models towards positive answers and introduction of a new score called the ShuffleAcc, and e) sensitivity to spelling and grammar errors. Our investigation demonstrates the difficulty of our dataset and shows that it can challenge VQA models for next few years. Data and code are publicly available at: DATA and CODE.
SplitMixer: Fat Trimmed From MLP-like Models
Borji, Ali, Lin, Sikun
We present SplitMixer, a simple and lightweight isotropic MLP-like architecture, for visual recognition. It contains two types of interleaving convolutional operations to mix information across spatial locations (spatial mixing) and channels (channel mixing). The first one includes sequentially applying two depthwise 1D kernels, instead of a 2D kernel, to mix spatial information. The second one is splitting the channels into overlapping or non-overlapping segments, with or without shared parameters, and applying our proposed channel mixing approaches or 3D convolution to mix channel information. Depending on design choices, a number of SplitMixer variants can be constructed to balance accuracy, the number of parameters, and speed. We show, both theoretically and experimentally, that SplitMixer performs on par with the state-of-the-art MLP-like models while having a significantly lower number of parameters and FLOPS. For example, without strong data augmentation and optimization, SplitMixer achieves around 94% accuracy on CIFAR-10 with only 0.28M parameters, while ConvMixer achieves the same accuracy with about 0.6M parameters. The well-known MLP-Mixer achieves 85.45% with 17.1M parameters. On CIFAR-100 dataset, SplitMixer achieves around 73% accuracy, on par with ConvMixer, but with about 52% fewer parameters and FLOPS. We hope that our results spark further research towards finding more efficient vision architectures and facilitate the development of MLP-like models. Code is available at https://github.com/aliborji/splitmixer.
Pros and Cons of GAN Evaluation Measures: New Developments
Borji, Ali
This work is an update of a previous paper on the same topic published a few years ago. With the dramatic progress in generative modeling, a suite of new quantitative and qualitative techniques to evaluate models has emerged. Although some measures such as Inception Score, Fr\'echet Inception Distance, Precision-Recall, and Perceptual Path Length are relatively more popular, GAN evaluation is not a settled issue and there is still room for improvement. For example, in addition to quality and diversity of synthesized images, generative models should be evaluated in terms of bias and fairness. I describe new dimensions that are becoming important in assessing models, and discuss the connection between GAN evaluation and deepfakes.