Goto

Collaborating Authors

 Yu, Shujian


Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

arXiv.org Artificial Intelligence

Multimodal alignment is crucial for various downstream tasks such as cross-modal generation and retrieval. Previous multimodal approaches like CLIP maximize the mutual information mainly by aligning pairwise samples across modalities while overlooking the distributional differences, leading to suboptimal alignment with modality gaps. In this paper, to overcome the limitation, we propose CS-Aligner, a novel and straightforward framework that performs distributional vision-language alignment by integrating Cauchy-Schwarz (CS) divergence with mutual information. In the proposed framework, we find that the CS divergence and mutual information serve complementary roles in multimodal alignment, capturing both the global distribution information of each modality and the pairwise semantic relationships, yielding tighter and more precise alignment. Moreover, CS-Aligher enables incorporating additional information from unpaired data and token-level representations, enhancing flexible and fine-grained alignment in practice. Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.


Deep Dynamic Probabilistic Canonical Correlation Analysis

arXiv.org Machine Learning

This paper presents Deep Dynamic Probabilistic Canonical Correlation Analysis (D2PCCA), a model that integrates deep learning with probabilistic modeling to analyze nonlinear dynamical systems. Building on the probabilistic extensions of Canonical Correlation Analysis (CCA), D2PCCA captures nonlinear latent dynamics and supports enhancements such as KL annealing for improved convergence and normalizing flows for a more flexible posterior approximation. D2PCCA naturally extends to multiple observed variables, making it a versatile tool for encoding prior knowledge about sequential datasets and providing a probabilistic understanding of the system's dynamics. Experimental validation on real financial datasets demonstrates the effectiveness of D2PCCA and its extensions in capturing latent dynamics.


Towards the Generalization of Multi-view Learning: An Information-theoretical Analysis

arXiv.org Machine Learning

In most scientific data analysis scenarios, data collected from diverse domains and different sensors exhibit heterogeneous properties while preserving underlying connections. For example, (1) a piece of text can express the same semantics and sentiment in multiple different languages; (2) the user's interest can be reflected in the text posted, images uploaded, and videos viewed; (3) animals perceive potential dangers in their surroundings through various senses such as sight, hearing, and smell. All of these reflect different perspectives of the data, collectively referred to as multi-view data. Extracting consensus and complementarity information from multiple views to achieve a comprehensive representation of multi-view data, has stimulated research interest across various fields and led to the development of multi-view learning Hamdi et al. (2021); Fan et al. (2022); Fu et al. (2022); Hong et al. (2023). While various methodologies have emerged in multi-view learning, predominantly encompassing canonical correlation analysis (CCA)-based approaches Gao et al. (2020); Chen et al. (2022); Shu et al. (2022) and engineering-driven techniques Xu et al. (2021); Bai et al. (2023), these methods suffer from a critical limitation. Specifically, their emphasis on maximizing cross-view consensus information often comes at the expense of view-specific, task-relevant information, thereby potentially compromising downstream performance Liang et al. (2024). Recent significant efforts have been dedicated to leveraging diverse information-theoretic techniques to precisely capture both view-common and view-unique components from multiple views Wang et al. (2019); Federici et al. (2020); Wang et al. (2023); Cui et al. (2024); Zhang et al. (2024), thereby yielding maximally disentangled representation and improving generalization ability. For instance, Kleinman et al. (2024) and Zhang et al. (2024) introduce the notion of Gács-Körner common information (Gács et al., 1973) and utilize total correlation between consensus and complementarity information to extract mutually independent cross-view common and unique components.


ELEMENT: Episodic and Lifelong Exploration via Maximum Entropy

arXiv.org Artificial Intelligence

This paper proposes \emph{Episodic and Lifelong Exploration via Maximum ENTropy} (ELEMENT), a novel, multiscale, intrinsically motivated reinforcement learning (RL) framework that is able to explore environments without using any extrinsic reward and transfer effectively the learned skills to downstream tasks. We advance the state of the art in three ways. First, we propose a multiscale entropy optimization to take care of the fact that previous maximum state entropy, for lifelong exploration with millions of state observations, suffers from vanishing rewards and becomes very expensive computationally across iterations. Therefore, we add an episodic maximum entropy over each episode to speedup the search further. Second, we propose a novel intrinsic reward for episodic entropy maximization named \emph{average episodic state entropy} which provides the optimal solution for a theoretical upper bound of the episodic state entropy objective. Third, to speed the lifelong entropy maximization, we propose a $k$ nearest neighbors ($k$NN) graph to organize the estimation of the entropy and updating processes that reduces the computation substantially. Our ELEMENT significantly outperforms state-of-the-art intrinsic rewards in both episodic and lifelong setups. Moreover, it can be exploited in task-agnostic pre-training, collecting data for offline reinforcement learning, etc.


Discovering Common Information in Multi-view Data

arXiv.org Artificial Intelligence

Another grounded in the consensus principle method draws on The advent of diverse and heterogeneous data due to recent mutual information in information theory. The authors in [17] technological advancements has spurred increasing interest in posit that each view contains identical task-relevant information, multi-view learning [1, 2, 3]. This field relies on two principles: a classic hypothesis suggesting that effective representation the consensus principle, which seeks consensus information models view-invariant factors. They develop robust representations across different views, and the complementary principle, by maximizing the mutual information between representations which recognizes the unique, valuable information each view from different views. A similar approach is used offers [4, 5, 6]. For instance, consider the case of an animal's in [18], where information about high-level factors that span binocular vision. Each eye captures a different yet highly correlated across multiple views is captured by maximizing the mutual information perspective of an object, extracting consensus information between the extracted features.


Generalized Cauchy-Schwarz Divergence and Its Deep Learning Applications

arXiv.org Artificial Intelligence

Divergence measures play a central role and become increasingly essential in deep learning, yet efficient measures for multiple (more than two) distributions are rarely explored. This becomes particularly crucial in areas where the simultaneous management of multiple distributions is both inevitable and essential. Examples include clustering, multi-source domain adaptation or generalization, and multi-view learning, among others. While computing the mean of pairwise distances between any two distributions is a prevalent method to quantify the total divergence among multiple distributions, it is imperative to acknowledge that this approach is not straightforward and necessitates significant computational resources. In this study, we introduce a new divergence measure tailored for multiple distributions named the generalized Cauchy-Schwarz divergence (GCSD). Additionally, we furnish a kernel-based closed-form sample estimator, making it convenient and straightforward to use in various machine-learning applications. Finally, we explore its profound implications in the realm of deep learning by applying it to tackle two thoughtfully chosen machine-learning tasks: deep clustering and multi-source domain adaptation. Our extensive experimental investigations confirm the robustness and effectiveness of GCSD in both scenarios. The findings also underscore the innovative potential of GCSD and its capability to significantly propel machine learning methodologies that necessitate the quantification of multiple distributions.


Domain Adaptation with Cauchy-Schwarz Divergence

arXiv.org Machine Learning

Domain adaptation aims to use training data from one or multiple source domains to learn a hypothesis that can be generalized to a different, but related, target domain. As such, having a reliable measure for evaluating the discrepancy of both marginal and conditional distributions is crucial. We introduce Cauchy-Schwarz (CS) divergence to the problem of unsupervised domain adaptation (UDA). The CS divergence offers a theoretically tighter generalization error bound than the popular Kullback-Leibler divergence. This holds for the general case of supervised learning, including multi-class classification and regression. Furthermore, we illustrate that the CS divergence enables a simple estimator on the discrepancy of both marginal and conditional distributions between source and target domains in the representation space, without requiring any distributional assumptions. We provide multiple examples to illustrate how the CS divergence can be conveniently used in both distance metric- or adversarial training-based UDA frameworks, resulting in compelling performance.


BAN: Detecting Backdoors Activated by Adversarial Neuron Noise

arXiv.org Artificial Intelligence

Backdoor attacks on deep learning represent a recent threat that has gained significant attention in the research community. Backdoor defenses are mainly based on backdoor inversion, which has been shown to be generic, model-agnostic, and applicable to practical threat scenarios. State-of-the-art backdoor inversion recovers a mask in the feature space to locate prominent backdoor features, where benign and backdoor features can be disentangled. However, it suffers from high computational overhead, and we also find that it overly relies on prominent backdoor features that are highly distinguishable from benign features. To tackle these shortcomings, this paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information. In particular, we adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which we can easily differentiate backdoored and clean models. Experimental results demonstrate our defense, BAN, is 1.37$\times$ (on CIFAR-10) and 5.11$\times$ (on ImageNet200) more efficient with 9.99% higher detect success rate than the state-of-the-art defense BTI-DBF. Our code and trained models are publicly available.\url{https://anonymous.4open.science/r/ban-4B32}


Jacobian Regularizer-based Neural Granger Causality

arXiv.org Artificial Intelligence

With the advancement of neural networks, diverse methods for neural Granger causality have emerged, which demonstrate proficiency in handling complex data, and nonlinear relationships. However, the existing framework of neural Granger causality has several limitations. It requires the construction of separate predictive models for each target variable, and the relationship depends on the sparsity on the weights of the first layer, resulting in challenges in effectively modeling complex relationships between variables as well as unsatisfied estimation accuracy of Granger causality. Moreover, most of them cannot grasp full-time Granger causality. To address these drawbacks, we propose a Jacobian Regularizer-based Neural Granger Causality (JRNGC) approach, a straightforward yet highly effective method for learning multivariate summary Granger causality and full-time Granger causality by constructing a single model for all target variables. Specifically, our method eliminates the sparsity constraints of weights by leveraging an input-output Jacobian matrix regularizer, which can be subsequently represented as the weighted causal matrix in the post-hoc analysis. Extensive experiments show that our proposed approach achieves competitive performance with the state-of-the-art methods for learning summary Granger causality and full-time Granger causality while maintaining lower model complexity and high scalability.


Cauchy-Schwarz Divergence Information Bottleneck for Regression

arXiv.org Machine Learning

The information bottleneck (IB) approach is popular to improve the generalization, robustness and explainability of deep neural networks. Essentially, it aims to find a minimum sufficient representation t by striking a trade-off between a compression term I(x; t) and a prediction term I(y; t), where I(;) refers to the mutual information (MI). MI is for the IB for the most part expressed in terms of the Kullback-Leibler (KL) divergence, which in the regression case corresponds to prediction based on mean squared error (MSE) loss with Gaussian assumption and compression approximated by variational inference. In this paper, we study the IB principle for the regression problem and develop a new way to parameterize the IB with deep neural networks by exploiting favorable properties of the Cauchy-Schwarz (CS) divergence. By doing so, we move away from MSE-based regression and ease estimation by avoiding variational approximations or distributional assumptions. We investigate the improved generalization ability of our proposed CS-IB and demonstrate strong adversarial robustness guarantees. We demonstrate its superior performance on six real-world regression tasks over other popular deep IB approaches. We additionally observe that the solutions discovered by CS-IB always achieve the best trade-off between prediction accuracy and compression ratio in the information plane. The code is available at https://github.com The information bottleneck (IB) principle was proposed by (Tishby et al., 1999) as an informationtheoretic framework for representation learning. It considers extracting information about a target variable y through a correlated variable x. The extracted information is characterized by another variable t, which is (a possibly randomized) function of x. Formally, the IB objective is to learn a representation t that maximizes its predictive power to y subject to some constraints on the amount of information that it carries about x: max I(y; t) s.t.