class label
DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling
Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global selfattention is often redundant, predominantly capturing local patterns--highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity.
Flow Matching Improving Flow Matching Models For Efficient High Quality Generation
Recently, Flow Matching models have pushed the boundaries of high-fidelity data generation across a wide range of domains. It typically employs a single large network to learn the entire generative trajectory from noise to data. Despite their effectiveness, this design struggles to capture distinct signal characteristics across timesteps simultaneously and incurs substantial inference costs due to the iterative evaluation of the entire model. To address these limitations, we propose Blockwise Flow Matching (BFM), a novel framework that partitions the generative trajectory into multiple temporal segments, each modeled by smaller but specialized velocity blocks. This blockwise design enables each block to specialize effectively in its designated interval, improving inference efficiency and sample quality. To further enhance generation fidelity, we introduce a Semantic Feature Guidance module that explicitly conditions velocity blocks on semantically rich features aligned with pretrained representations. Additionally, we propose a lightweight Feature Residual Approximation strategy that preserves semantic quality while significantly reducing inference cost. Extensive experiments on ImageNet 256 256 demonstrate that BFM establishes a substantially improved Pareto frontier over existing Flow Matching methods, achieving 2.1 to 4.9 accelerations in inference complexity at comparable generation performance. Code is available at https://github.com/mlvlab/BFM.
TROVE: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models
Vision-language models (VLMs) have made great strides in addressing temporal understanding tasks, which involve characterizing visual changes across a sequence of images. However, recent works have suggested that when making predictions, VLMs may rely on static feature biases, such as background or object features, rather than dynamic visual changes. Static feature biases are a type of shortcut and can contribute to systematic prediction errors on downstream tasks; as a result, identifying and characterizing error-inducing static feature biases is critical prior to real-world model deployment. Existing approaches for identifying such systematic failure modes in trained models (i) are typically designed for nontemporal settings and (ii) are challenging to evaluate in temporal settings due to the lack of quantitative evaluation frameworks. In this work, we address these challenges by introducing TROVE, an automated approach for discovering errorinducing static feature biases learned by temporal VLMs. Given a trained VLM and an annotated validation dataset associated with a downstream classification task, TROVE extracts candidate static features from the dataset and scores each feature by (i) the effect of the feature on classification errors as well as (ii) the extent to which the VLM relies on the feature when making predictions. In order to quantitatively evaluate TROVE, we introduce an evaluation framework consisting of 101 trained temporal VLMs paired with ground-truth annotations for learned static feature biases. We use this framework to demonstrate that TROVE can accurately identify error-inducing static feature biases in VLMs, achieving a 28.6% improvement over the closest baseline. Finally, we apply TROVE to 7 off-the-shelf VLMs and 2 temporal understanding tasks, surfacing previouslyunknown static feature biases and demonstrating that knowledge of learned biases can aid in improving model performance at test time.
Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think
REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256 256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving 63 and 23 faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively.
Principled Long-Tailed Generative Modeling via Diffusion Models
Deep generative models, particularly diffusion models, have achieved remarkable success but face significant challenges when trained on real-world, long-tailed datasets-where few head classes dominate and many tail classes are underrepresented. This paper develops a theoretical framework for long-tailed learning via diffusion models through the lens of deep mutual learning. We introduce a novel regularized training objective that combines the standard diffusion loss with a mutual learning term, enabling balanced performance across all class labels, including the underrepresented tails. Our approach to learn via the proposed regularized objective is to formulate it as a multi-player game, with Nash equilibrium serving as the solution concept. We derive a non-asymptotic first-order convergence result for individual gradient descent algorithm to find the Nash equilibrium. We show that the Nash gap of the score network obtained from the algorithm is upper bounded by $\mathcal{O}(\frac{1}{\sqrt{T_{train}}}+\beta)$ where $\beta$ is the regularizing parameter and $T_{train}$ is the number of iterations of the training algorithm. Furthermore, we theoretically establish hyper-parameters for training and sampling algorithm that ensure that we find conditional score networks (under our model) with a worst case sampling error $\mathcal{O}(\epsilon+1), \forall \epsilon> 0$ across all class labels. Our results offer insights and guarantees for training diffusion models on imbalanced, long-tailed data, with implications for fairness, privacy, and generalization in real-world generative modeling scenarios.
Formal Models of Active Learning from Contrastive Examples
Machine learning can greatly benefit from providing learning algorithms with pairs of contrastive training examples---typically pairs of instances that differ only slightly, yet have different class labels. Intuitively, the difference in the instances helps explain the difference in the class labels. This paper proposes a theoretical framework in which the effect of various types of contrastive examples on active learners is studied formally. The focus is on the sample complexity of learning concept classes and how it is influenced by the choice of contrastive examples. We illustrate our results with geometric concept classes and classes of Boolean functions. Interestingly, we reveal a connection between learning from contrastive examples and the classical model of self-directed learning.
Supplementary Material Hardware Resilience Properties of Text-Guided Image Classifiers
This section contains supplementary material that provides additional details for the main paper and further experimental analysis. In this section, we provide detailed hyperparameters (Table 4) used to train each of the architectures on which results are reported in the main paper. Note that if the batchsize is reduced, the learning rate should be linearly scaled accordingly. Note that for error injection experiments, we perform single-bit flips only in the convolutional and linear layers of the neural network, in line with other work in this field. The primary motivation is that these two layer types are the most computationally intensive, consuming 90% 95%of a DNN's computations.
5812f92450ccaf17275500841c70924a-Supplemental.pdf
We present a brief proof about the local optimality of one-hot encodings in the decision-theoretic framework presented in Section 3.2. We seek to prove that, under assumptions of an identity reward matrix, tokens constrained to a unit hypercube, and gaussian additive noise, one-hot tokens are an optimally robust communication strategy. We only seek to prove local optimality, as one many trivially generate multiple, equally optimal tokens by, for example, flipping all bits. The following derivation uses Karush-Kuhn-Tucker (KKT) conditions, a generalization of Lagrange multipliers [17]. We maximize the function, subject to constraints. T>j Ti Ti + ||Tj||2 Ti # ~µi + ~λi = ~0 (13) (14) We seek to show that one-hot vectors are an optimum, so we now show that one-hot vectors indeed respect the constraints and set the derivatives to zero.
When Domain Experts
Human Intelligence (HI) excels at combining basic skills to solve complex tasks. This capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AIAgents, enabling them to harness expert models for complex task-solving towards Artificial General Intelligence (AGI). Large Language Models (LLMs) show promising learning and reasoning abilities, and can effectively use external models, tools, plugins, or APIs to tackle complex problems. In this work, we introduce OpenAGI, an open-source AGI research and development platform designed for solving multi-step, real-world tasks. Specifically, OpenAGI uses a dual strategy, integrating standard benchmark tasks for benchmarking and evaluation, and open-ended tasks including more expandable models, tools, plugins, or APIs for creative problem-solving. Tasks are presented as natural language queries to the LLM, which then selects and executes appropriate models. We also propose a Reinforcement Learning from Task Feedback (RLTF) mechanism that uses task results to improve the LLM's task-solving ability, which creates a self-improving AI feedback loop. While we acknowledge that AGI is a broad and multifaceted research challenge with no singularly defined solution path, the integration of LLMs with domain-specific expert models, inspired by mirroring the blend of general and specialized intelligence in humans, offers a promising approach towards AGI.