Goto

Collaborating Authors

 Banff


Generative Visual Communication in the Era of Vision-Language Models

arXiv.org Artificial Intelligence

Visual communication, dating back to prehistoric cave paintings, is the use of visual elements to convey ideas and information. In today's visually saturated world, effective design demands an understanding of graphic design principles, visual storytelling, human psychology, and the ability to distill complex information into clear visuals. This dissertation explores how recent advancements in vision-language models (VLMs) can be leveraged to automate the creation of effective visual communication designs. Although generative models have made great progress in generating images from text, they still struggle to simplify complex ideas into clear, abstract visuals and are constrained by pixel-based outputs, which lack flexibility for many design tasks. To address these challenges, we constrain the models' operational space and introduce task-specific regularizations. We explore various aspects of visual communication, namely, sketches and visual abstraction, typography, animation, and visual inspiration.


Explainable AI Approach using Near Misses Analysis

arXiv.org Artificial Intelligence

This paper introduces a novel XAI approach based on near-misses analysis (NMA). This approach reveals a hierarchy of logical 'concepts' inferred from the latent decision-making process of a Neural Network (NN) without delving into its explicit structure. We examined our proposed XAI approach on different network architectures that vary in size and shape (e.g., ResNet, VGG, EfficientNet, MobileNet) on several datasets (ImageNet and CIFAR100). The results demonstrate its usability to reflect NNs latent process of concepts generation. We generated a new metric for explainability. Moreover, our experiments suggest that efficient architectures, which achieve a similar accuracy level with much less neurons may still pay the price of explainability and robustness in terms of concepts generation. We, thus, pave a promising new path for XAI research to follow.


Finding Structure in Language Models

arXiv.org Artificial Intelligence

When we speak, write or listen, we continuously make predictions based on our knowledge of a language's grammar. Remarkably, children acquire this grammatical knowledge within just a few years, enabling them to understand and generalise to novel constructions that have never been uttered before. Language models are powerful tools that create representations of language by incrementally predicting the next word in a sentence, and they have had a tremendous societal impact in recent years. The central research question of this thesis is whether these models possess a deep understanding of grammatical structure similar to that of humans. This question lies at the intersection of natural language processing, linguistics, and interpretability. To address it, we will develop novel interpretability techniques that enhance our understanding of the complex nature of large-scale language models. We approach our research question from three directions. First, we explore the presence of abstract linguistic information through structural priming, a key paradigm in psycholinguistics for uncovering grammatical structure in human language processing. Next, we examine various linguistic phenomena, such as adjective order and negative polarity items, and connect a model's comprehension of these phenomena to the data distribution on which it was trained. Finally, we introduce a controlled testbed for studying hierarchical structure in language models using various synthetic languages of increasing complexity and examine the role of feature interactions in modelling this structure. Our findings offer a detailed account of the grammatical knowledge embedded in language model representations and provide several directions for investigating fundamental linguistic questions using computational methods.


A Computational Method for Measuring "Open Codes" in Qualitative Analysis

arXiv.org Artificial Intelligence

Qualitative analysis is critical to understanding human datasets in many social science disciplines. Open coding is an inductive qualitative process that identifies and interprets "open codes" from datasets. Yet, meeting methodological expectations (such as "as exhaustive as possible") can be challenging. While many machine learning (ML)/generative AI (GAI) studies have attempted to support open coding, few have systematically measured or evaluated GAI outcomes, increasing potential bias risks. Building on Grounded Theory and Thematic Analysis theories, we present a computational method to measure and identify potential biases from "open codes" systematically. Instead of operationalizing human expert results as the "ground truth," our method is built upon a team-based approach between human and machine coders. We experiment with two HCI datasets to establish this method's reliability by 1) comparing it with human analysis, and 2) analyzing its output stability. We present evidence-based suggestions and example workflows for ML/GAI to support open coding.


Fusion Matters: Learning Fusion in Deep Click-through Rate Prediction Models

arXiv.org Artificial Intelligence

The evolution of previous Click-Through Rate (CTR) models has mainly been driven by proposing complex components, whether shallow or deep, that are adept at modeling feature interactions. However, there has been less focus on improving fusion design. Instead, two naive solutions, stacked and parallel fusion, are commonly used. Both solutions rely on pre-determined fusion connections and fixed fusion operations. It has been repetitively observed that changes in fusion design may result in different performances, highlighting the critical role that fusion plays in CTR models. While there have been attempts to refine these basic fusion strategies, these efforts have often been constrained to specific settings or dependent on specific components. Neural architecture search has also been introduced to partially deal with fusion design, but it comes with limitations. The complexity of the search space can lead to inefficient and ineffective results. To bridge this gap, we introduce OptFusion, a method that automates the learning of fusion, encompassing both the connection learning and the operation selection. We have proposed a one-shot learning algorithm tackling these tasks concurrently. Our experiments are conducted over three large-scale datasets. Extensive experiments prove both the effectiveness and efficiency of OptFusion in improving CTR model performance. Our code implementation is available here\url{https://github.com/kexin-kxzhang/OptFusion}.


Reliable Evaluation of Attribution Maps in CNNs: A Perturbation-Based Approach

arXiv.org Artificial Intelligence

In this paper, we present an approach for evaluating attribution maps, which play a central role in interpreting the predictions of convolutional neural networks (CNNs). We show that the widely used insertion/deletion metrics are susceptible to distribution shifts that affect the reliability of the ranking. Our method proposes to replace pixel modifications with adversarial perturbations, which provides a more robust evaluation framework. By using smoothness and monotonicity measures, we illustrate the effectiveness of our approach in correcting distribution shifts. In addition, we conduct the most comprehensive quantitative and qualitative assessment of attribution maps to date. Introducing baseline attribution maps as sanity checks, we find that our metric is the only contender to pass all checks. Using Kendall's $\tau$ rank correlation coefficient, we show the increased consistency of our metric across 15 dataset-architecture combinations. Of the 16 attribution maps tested, our results clearly show SmoothGrad to be the best map currently available. This research makes an important contribution to the development of attribution maps by providing a reliable and consistent evaluation framework. To ensure reproducibility, we will provide the code along with our results.


Delving into the Reversal Curse: How Far Can Large Language Models Generalize?

arXiv.org Artificial Intelligence

While large language models (LLMs) showcase unprecedented capabilities, they also exhibit certain inherent limitations when facing seemingly trivial tasks. A prime example is the recently debated "reversal curse", which surfaces when models, having been trained on the fact "A is B", struggle to generalize this knowledge to infer that "B is A". In this paper, we examine the manifestation of the reversal curse across various tasks and delve into both the generalization abilities and the problem-solving mechanisms of LLMs. This investigation leads to a series of significant insights: (1) LLMs are able to generalize to "B is A" when both A and B are presented in the context as in the case of a multiple-choice question. (2) This generalization ability is highly correlated to the structure of the fact "A is B" in the training documents. For example, this generalization only applies to biographies structured in "[Name] is [Description]" but not to "[Description] is [Name]". (3) We propose and verify the hypothesis that LLMs possess an inherent bias in fact recalling during knowledge application, which explains and underscores the importance of the document structure to successful learning. (4) The negative impact of this bias on the downstream performance of LLMs can hardly be mitigated through training alone. These findings offer a novel perspective on interpreting LLMs' generalization through their intrinsic mechanisms and provide insights for developing more effective learning methods. Our code and data are available at https://github.com/alibaba/thinking_bias.git.


Proportional infinite-width infinite-depth limit for deep linear neural networks

arXiv.org Machine Learning

We study the distributional properties of linear neural networks with random parameters in the context of large networks, where the number of layers diverges in proportion to the number of neurons per layer. Prior works have shown that in the infinite-width regime, where the number of neurons per layer grows to infinity while the depth remains fixed, neural networks converge to a Gaussian process, known as the Neural Network Gaussian Process. However, this Gaussian limit sacrifices descriptive power, as it lacks the ability to learn dependent features and produce output correlations that reflect observed labels. Motivated by these limitations, we explore the joint proportional limit in which both depth and width diverge but maintain a constant ratio, yielding a non-Gaussian distribution that retains correlations between outputs. Our contribution extends previous works by rigorously characterizing, for linear activation functions, the limiting distribution as a nontrivial mixture of Gaussians.


ADOPT: Modified Adam Can Converge with Any $\beta_2$ with the Optimal Rate

arXiv.org Machine Learning

Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., $\beta_2$, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $\beta_2$ without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at https://github.com/iShohei220/adopt.


Gradient-based optimization for variational empirical Bayes multiple regression

arXiv.org Machine Learning

Multiple linear regression provides a simple, but widely used, method to find associations between outcomes (responses) and a set of predictors (explanatory variables). It has been actively studied over more than a century, and there is a rich and vast literature on the subject [1]. In practical situations the number of predictor variables is often large, and it becomes desirable to induce sparsity in the regression coefficients to avoid overfitting [2, 3]. Sparse linear regression also serves as the foundation for non-linear techniques, such as trendfiltering [4, 5], which can estimate an underlying non-linear trend from time series data. Applications of sparse multiple linear regression and trendfiltering arise in a wide range of applications in modern science and engineering, including astronomy [6], atmospheric sciences [7], biology [8], economics [9, 10], genetics [11-15], geophysics [16], medical sciences [17, 18], social sciences [19] and text analysis [20]. Approaches to sparse linear regression can be broadly classified into two groups: (a) penalized linear regressions (PLR), which add a penalty term to the likelihood to penalize the magnitude of its parameters [21-23], and (b) Bayesian approaches [11-14, 24-29], which use a prior probability distribution on the model parameters to induce sparsity.