Information Technology
How Diffusion Models Learn to Factorize and Compose
Diffusion models are capable of generating photo-realistic images that combine elements which likely do not appear together in the training set, demonstrating the ability to compositionally generalize. Nonetheless, the precise mechanism of compositionality and how it is acquired through training remains elusive. Inspired by cognitive neuroscientific approaches, we consider a highly reduced setting to examine whether and when diffusion models learn semantically meaningful and factorized representations of composable features. We performed extensive controlled experiments on conditional Denoising Diffusion Probabilistic Models (DDPMs) trained to generate various forms of 2D Gaussian bump images. We found that the models learn factorized but not fully continuous manifold representations for encoding continuous features of variation underlying the data. With such representations, models demonstrate superior feature compositionality but limited ability to interpolate over unseen values of a given feature. Our experimental results further demonstrate that by training with independent factors of variation, diffusion models can attain compositionality with few compositional examples, suggesting a more efficient way to train DDPMs. Finally, we connect manifold formation in diffusion models to percolation theory in physics, offering insight into the sudden onset of factorized representation learning. Our thorough toy experiments thus contribute a deeper understanding of how diffusion models capture compositional structure in data.
Entity Alignment with Noisy Annotations from Large Language Models
Entity alignment (EA) aims to merge two knowledge graphs (KGs) by identifying equivalent entity pairs. While existing methods heavily rely on human-generated labels, it is prohibitively expensive to incorporate cross-domain experts for annotation in real-world scenarios. The advent of Large Language Models (LLMs) presents new avenues for automating EA with annotations, inspired by their comprehensive capability to process semantic information. However, it is nontrivial to directly apply LLMs for EA since the annotation space in real-world KGs is large. LLMs could also generate noisy labels that may mislead the alignment.
ReLIZO: Sample Reusable Linear Interpolation-based Zeroth-order Optimization Xiaoxing Wang
Gradient estimation is critical in zeroth-order optimization methods, which aims to obtain the descent direction by sampling update directions and querying function evaluations. Extensive research has been conducted including smoothing and linear interpolation. The former methods smooth the objective function, causing a biased gradient estimation, while the latter often enjoys more accurate estimates, at the cost of large amounts of samples and queries at each iteration to update variables. This paper resorts to the linear interpolation strategy and proposes to reduce the complexity of gradient estimation by reusing queries in the prior iterations while maintaining the sample size unchanged. Specifically, we model the gradient estimation as a quadratically constrained linear program problem and manage to derive the analytical solution. It innovatively decouples the required sample size from the variable dimension without extra conditions required, making it able to leverage the queries in the prior iterations. Moreover, part of the intermediate variables that contribute to the gradient estimation can be directly indexed, significantly reducing the computation complexity.
CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
Data selection has emerged as a core issue for large-scale visual-language model pretraining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e.g., CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce negCLIPLoss, a method inspired by CLIP training loss that adds the alignment between one sample and its contrastive pairs as an extra normalization term to CLIPScore for better quality measurement.
BLAST: Block-Level Adaptive Structured Matrices for Efficient Deep Neural Network Inference
To address these challenges, we introduce the Block-Level Adaptive STructured (BLAST) matrix, designed to learn and leverage efficient structures prevalent in the weight matrices of linear layers within deep learning models. Compared to existing structured matrices, the BLAST matrix offers substantial flexibility, as it can represent various types of structures that are either learned from data or computed from pre-existing weight matrices.
Natural Counterfactuals With Necessary Backtracking Guang-Yuan Hao 1,5, Hao Wang
Counterfactual reasoning is pivotal in human cognition and especially important for providing explanations and making decisions. While Judea Pearl's influential approach is theoretically elegant, its generation of a counterfactual scenario often requires too much deviation from the observed scenarios to be feasible, as we show using simple examples. To mitigate this difficulty, we propose a framework of natural counterfactuals and a method for generating counterfactuals that are more feasible with respect to the actual data distribution. Our methodology incorporates a certain amount of backtracking when needed, allowing changes in causally preceding variables to minimize deviations from realistic scenarios. Specifically, we introduce a novel optimization framework that permits but also controls the extent of backtracking with a "naturalness" criterion. Empirical experiments demonstrate the effectiveness of our method.
Towards Universal Mesh Movement Networks Chunyang Wang 1 Stephan Kramer 1 Joseph G. Wallwork 2
Solving complex Partial Differential Equations (PDEs) accurately and efficiently is an essential and challenging problem in all scientific and engineering disciplines. Mesh movement methods provide the capability to improve the accuracy of the numerical solution without increasing the overall mesh degree of freedom count. Conventional sophisticated mesh movement methods are extremely expensive and struggle to handle scenarios with complex boundary geometries. However, existing learning-based methods require re-training from scratch given a different PDE type or boundary geometry, which limits their applicability, and also often suffer from robustness issues in the form of inverted elements. In this paper, we introduce the Universal Mesh Movement Network (UM2N), which - once trained - can be applied in a non-intrusive, zero-shot manner to move meshes with different size distributions and structures, for solvers applicable to different PDE types and boundary geometries.
Claude's AI voice mode is finally rolling out - for free. Here's what you can do with it
Chatting with your favorite AI is often livelier, more convenient, and easier when you can carry on an actual voice conversation instead of typing at a prompt. Now Claude AI is joining the likes of ChatGPT, Google Gemini, and Microsoft Copilot with a voice mode all its own. On Tuesday, Anthropic announced that voice mode is now rolling out in beta to the Claude iOS and Android apps. Available in English, the feature will land on all Claude AI plans, even the freebie, over the next few weeks. Yes, this means you'll be able to kick off a voice conversation with the AI and then continue with your back-and-forth banter.
OVT-B: A New Large-Scale Benchmark for Open-Vocabulary Multi-Object Tracking Supplementary Material School of Software Technology, Zhejiang University
Motivation For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? In the current task of open-vocabulary multi-object tracking (OVMOT), there is only one benchmark available, which lacks high-quality, large-scale datasets. The existing dataset suffers from several limitations, including insufficient categories, limited video data, and a significant imbalance between base classes and novel classes. These deficiencies make it inadequate for supporting the evaluation of new OVMOT models. Our proposed dataset aims to provide a more comprehensive evaluation platform for the OVMOT task. Who created this dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? This dataset was constructed by collecting and extracting data from seven other datasets and applying unified annotations. This work was completed by Haiji Liang and Ruize Han. Who funded the creation of the dataset?