Shi, Javen Qinfeng
Analytic DAG Constraints for Differentiable DAG Learning
Zhang, Zhen, Ng, Ignavier, Gong, Dong, Liu, Yuhang, Gong, Mingming, Huang, Biwei, Zhang, Kun, Hengel, Anton van den, Shi, Javen Qinfeng
Recovering the underlying Directed Acyclic Graph (DAG) structures from observational data presents a formidable challenge, partly due to the combinatorial nature of the DAG-constrained optimization problem. Recently, researchers have identified gradient vanishing as one of the primary obstacles in differentiable DAG learning and have proposed several DAG constraints to mitigate this issue. By developing the necessary theory to establish a connection between analytic functions and DAG constraints, we demonstrate that analytic functions from the set $\{f(x) = c_0 + \sum_{i=1}^{\infty}c_ix^i | \forall i > 0, c_i > 0; r = \lim_{i\rightarrow \infty}c_{i}/c_{i+1} > 0\}$ can be employed to formulate effective DAG constraints. Furthermore, we establish that this set of functions is closed under several functional operators, including differentiation, summation, and multiplication. Consequently, these operators can be leveraged to create novel DAG constraints based on existing ones. Using these properties, we design a series of DAG constraints and develop an efficient algorithm to evaluate them. Experiments in various settings demonstrate that our DAG constraints outperform previous state-of-the-art comparators. Our implementation is available at https://github.com/zzhang1987/AnalyticDAGLearning.
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Liu, Yuhang, Gong, Dong, Gao, Erdun, Zhang, Zhen, Huang, Biwei, Gong, Mingming, Hengel, Anton van den, Shi, Javen Qinfeng
The remarkable achievements of large language models (LLMs) have led many to conclude that they exhibit a form of intelligence. This is as opposed to explanations of their capabilities based on their ability to perform relatively simple manipulations of vast volumes of data. To illuminate the distinction between these explanations, we introduce a novel generative model that generates tokens on the basis of human interpretable concepts represented as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish an identifiability result: the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts, up to an invertible linear transformation. This theoretical finding not only provides evidence that LLMs capture underlying generative factors, but also strongly reinforces the linear representation hypothesis, which posits that LLMs learn linear representations of human-interpretable concepts. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families.
MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models
Cheng, Hongrong, Zhang, Miao, Shi, Javen Qinfeng
As Large Language Models (LLMs) grow dramatically in size, there is an increasing trend in compressing and speeding up these models. Previous studies have highlighted the usefulness of gradients for importance scoring in neural network compressing, especially in pruning medium-size networks. However, the substantial memory requirements involved in calculating gradients with backpropagation impede the utilization of gradients in guiding LLM pruning. As a result, most pruning strategies for LLMs rely on gradient-free criteria, such as weight magnitudes or a mix of magnitudes and activations. In this paper, we devise a hybrid pruning criterion, which appropriately integrates magnitude, activation, and gradient to capitalize on feature map sensitivity for pruning LLMs. To overcome memory requirement barriers, we estimate gradients using only forward passes. Based on this, we propose a Memory-effIcieNt structured prunIng procedure for LLMs (MINI-LLM) to remove no-critical channels and multi-attention heads. Experimental results demonstrate the superior performance of MINI-LLM over existing gradient-free methods on three LLMs: LLaMA, BLOOM, and OPT across various downstream tasks (classification, multiple-choice, and generation), while MINI-LLM maintains a GPU memory footprint akin to gradient-free methods.
Augmented Commonsense Knowledge for Remote Object Grounding
Mohammadi, Bahram, Hong, Yicong, Qi, Yuankai, Wu, Qi, Pan, Shirui, Shi, Javen Qinfeng
The vision-and-language navigation (VLN) task necessitates an agent to perceive the surroundings, follow natural language instructions, and act in photo-realistic unseen environments. Most of the existing methods employ the entire image or object features to represent navigable viewpoints. However, these representations are insufficient for proper action prediction, especially for the REVERIE task, which uses concise high-level instructions, such as ''Bring me the blue cushion in the master bedroom''. To address enhancing representation, we propose an augmented commonsense knowledge model (ACK) to leverage commonsense information as a spatio-temporal knowledge graph for improving agent navigation. Specifically, the proposed approach involves constructing a knowledge base by retrieving commonsense information from ConceptNet, followed by a refinement module to remove noisy and irrelevant knowledge. We further present ACK which consists of knowledge graph-aware cross-modal and concept aggregation modules to enhance visual representation and visual-textual data alignment by integrating visible objects, commonsense knowledge, and concept history, which includes object and knowledge temporal information. Moreover, we add a new pipeline for the commonsense-based decision-making process which leads to more accurate local action prediction. Experimental results demonstrate our proposed model noticeably outperforms the baseline and archives the state-of-the-art on the REVERIE benchmark.
Identifiable Latent Neural Causal Models
Liu, Yuhang, Zhang, Zhen, Gong, Dong, Gong, Mingming, Huang, Biwei, Hengel, Anton van den, Zhang, Kun, Shi, Javen Qinfeng
Causal representation learning seeks to uncover latent, high-level causal representations from low-level observed data. It is particularly good at predictions under unseen distribution shifts, because these shifts can generally be interpreted as consequences of interventions. Hence leveraging {seen} distribution shifts becomes a natural strategy to help identifying causal representations, which in turn benefits predictions where distributions are previously {unseen}. Determining the types (or conditions) of such distribution shifts that do contribute to the identifiability of causal representations is critical. This work establishes a {sufficient} and {necessary} condition characterizing the types of distribution shifts for identifiability in the context of latent additive noise models. Furthermore, we present partial identifiability results when only a portion of distribution shifts meets the condition. In addition, we extend our findings to latent post-nonlinear causal models. We translate our findings into a practical algorithm, allowing for the acquisition of reliable latent causal representations. Our algorithm, guided by our underlying theory, has demonstrated outstanding performance across a diverse range of synthetic and real-world datasets. The empirical observations align closely with the theoretical findings, affirming the robustness and effectiveness of our approach.
Revealing Multimodal Contrastive Representation Learning through Latent Partial Causal Models
Liu, Yuhang, Zhang, Zhen, Gong, Dong, Huang, Biwei, Gong, Mingming, Hengel, Anton van den, Zhang, Kun, Shi, Javen Qinfeng
One promising methods have proven successful across a range strategy in this context is to use data from one modality, e.g., of domains, partly due to their ability to generate text data, as a supervision signal in the interpretation of another, meaningful shared representations of complex e.g., image data (Mori et al., 1999; Wang et al., 2009; phenomena. To enhance the depth of analysis Ramanathan et al., 2013; He & Peng, 2017; Radford et al., and understanding of these acquired representations, 2021). The primary approach for achieving this is known we introduce a unified causal model specifically as multimodal contrastive representation learning, which designed for multimodal data. By examining focuses on optimizing a symmetric contrastive loss (Zhang this model, we show that multimodal contrastive et al., 2022; Radford et al., 2021), e.g., a symmetric adaptation representation learning excels at identifying latent of the standard contrastive loss (Wu et al., 2018; Tian coupled variables within the proposed unified et al., 2020; He et al., 2020; Chen et al., 2020). The learned model, up to linear or permutation transformations representations, guided by the symmetric contrastive loss, resulting from different assumptions. Our have been applied in a variety of applications, including findings illuminate the potential of pre-trained zero/few-shot learning (Radford et al., 2021; Zhou et al., multimodal models, e.g., CLIP, in learning disentangled 2022a), domain generalization (Zhou et al., 2022a;b), and representations through a surprisingly robustness to adversarial examples (Ban & Dong, 2022).
Bayesian Learning with Information Gain Provably Bounds Risk for a Robust Adversarial Defense
Doan, Bao Gia, Abbasnejad, Ehsan, Shi, Javen Qinfeng, Ranasinghe, Damith C.
We present a new algorithm to learn a deep neural network model robust against adversarial attacks. Previous algorithms demonstrate an adversarially trained Bayesian Neural Network (BNN) provides improved robustness. We recognize the adversarial learning approach for approximating the multi-modal posterior distribution of a Bayesian model can lead to mode collapse; consequently, the model's achievements in robustness and performance are sub-optimal. Instead, we first propose preventing mode collapse to better approximate the multi-modal posterior distribution. Second, based on the intuition that a robust model should ignore perturbations and only consider the informative content of the input, we conceptualize and formulate an information gain objective to measure and force the information learned from both benign and adversarial training instances to be similar. Importantly. we prove and demonstrate that minimizing the information gain objective allows the adversarial risk to approach the conventional empirical risk. We believe our efforts provide a step toward a basis for a principled method of adversarially training BNNs. Our model demonstrate significantly improved robustness--up to 20%--compared with adversarial training and Adv-BNN under PGD attacks with 0.035 distortion on both CIFAR-10 and STL-10 datasets.
Identifiable Latent Polynomial Causal Models Through the Lens of Change
Liu, Yuhang, Zhang, Zhen, Gong, Dong, Gong, Mingming, Huang, Biwei, Hengel, Anton van den, Zhang, Kun, Shi, Javen Qinfeng
Causal representation learning aims to unveil latent high-level causal representations from observed low-level data. One of its primary tasks is to provide reliable assurance of identifying these latent causal models, known as identifiability. A recent breakthrough explores identifiability by leveraging the change of causal influences among latent causal variables across multiple environments (Liu et al., 2022). However, this progress rests on the assumption that the causal relationships among latent causal variables adhere strictly to linear Gaussian models. In this paper, we extend the scope of latent causal models to involve nonlinear causal relationships, represented by polynomial models, and general noise distributions conforming to the exponential family. Additionally, we investigate the necessity of imposing changes on all causal parameters and present partial identifiability results when part of them remains unchanged. Further, we propose a novel empirical estimation method, grounded in our theoretical finding, that enables learning consistent latent causal representations. Our experimental results, obtained from both synthetic and real-world data, validate our theoretical contributions concerning identifiability and consistency. Causal representation learning, aiming to discover high-level latent causal variables and causal structures among them from unstructured observed data, provides a prospective way to compensate for drawbacks in traditional machine learning paradigms, e.g., the most fundamental limitations that data, driving and promoting the machine learning methods, needs to be independent and identically distributed (i.i.d.) (Sch olkopf, 2015). From the perspective of causal representations, the changes in data distribution, arising from various real-world data collection pipelines (Karahan et al., 2016; Frumkin, 2016; Pearl et al., 2016; Chandrasekaran et al., 2021), can be attributed to the changes of causal influences among causal variables (Sch olkopf et al., 2021). These changes are observable across a multitude of fields. For instance, these could appear in the analysis of imaging data of cells, where the contexts involve batches of cells exposed to various small-molecule compounds.
A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations
Cheng, Hongrong, Zhang, Miao, Shi, Javen Qinfeng
Abstract--Modern deep neural networks, particularly recent large language models, come with massive model sizes that require significant computational and storage resources. To enable the deployment of modern models on resource-constrained environments and accelerate inference time, researchers have increasingly explored pruning techniques as a popular research direction in neural network compression. More than a thousand pruning papers have been published each year from 2020 to 2022. However, there is a dearth of up-to-date comprehensive review papers on pruning. To address this issue, in this survey, we provide a comprehensive review of existing research works on deep neural network pruning in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to prune, and 4) fusion of pruning and other compression techniques. We then provide a thorough comparative analysis of seven pairs of contrast settings for pruning (e.g., unstructured/structured, one-shot/iterative, data-free/data-driven, initialized/pretrained weights, etc.) and explore several emerging topics, including post-training pruning, different levels of supervision for pruning to shed light on the commonalities and differences of existing methods and lay the foundation for further method development. Finally, we provide some valuable recommendations on selecting pruning methods and prospect several promising research directions for neural network pruning. To facilitate future research on deep neural network pruning, we summarize broad pruning applications (e.g., adversarial robustness, natural language understanding, etc.) and build a curated collection of datasets, networks, and evaluations on different applications. We will keep updating this repository to include the latest advancements in the field. Over the past several years, Deep Neural Networks resources (such as CPU, GPU, and memory), energy, and (DNNs) have achieved conspicuous progress in various bandwidth [11, 12, 13]. Although DNNs achieve including fast real-time response and compact memory remarkable success in various areas, their performance footprint. Deep neural networks' computational complexity heavily relies on model parameters and computational cost. With the popularity of large 95MB memory for storage, contains over 23 million trainable language models in recent years, there is growing interest parameters, and requires 4 GFLOPs (Giga Floating Point in compressing neural networks for computers with flexible Operations) of computations [7]. In addition, deep neural networks trained on ImageNet [1] is more than 500 MB [8]. The that contain redundant features can undermine their Transformer network GPT-3 model consists of up to 175 robustness, elevating the risk of adversarial attacks [16].
Factor Graph Neural Networks
Zhang, Zhen, Dupty, Mohammed Haroon, Wu, Fan, Shi, Javen Qinfeng, Lee, Wee Sun
In recent years, we have witnessed a surge of Graph Neural Networks (GNNs), most of which can learn powerful representations in an end-to-end fashion with great success in many real-world applications. They have resemblance to Probabilistic Graphical Models (PGMs), but break free from some limitations of PGMs. By aiming to provide expressive methods for representation learning instead of computing marginals or most likely configurations, GNNs provide flexibility in the choice of information flowing rules while maintaining good performance. Despite their success and inspirations, they lack efficient ways to represent and learn higher-order relations among variables/nodes. More expressive higher-order GNNs which operate on k-tuples of nodes need increased computational resources in order to process higher-order tensors. We propose Factor Graph Neural Networks (FGNNs) to effectively capture higher-order relations for inference and learning. To do so, we first derive an efficient approximate Sum-Product loopy belief propagation inference algorithm for discrete higher-order PGMs. We then neuralize the novel message passing scheme into a Factor Graph Neural Network (FGNN) module by allowing richer representations of the message update rules; this facilitates both efficient inference and powerful end-to-end learning. We further show that with a suitable choice of message aggregation operators, our FGNN is also able to represent Max-Product belief propagation, providing a single family of architecture that can represent both Max and Sum-Product loopy belief propagation. Our extensive experimental evaluation on synthetic as well as real datasets demonstrates the potential of the proposed model.