Deng, Zhiwei
ICONS: Influence Consensus for Vision-Language Data Selection
Wu, Xindi, Xia, Mengzhou, Shao, Rulin, Deng, Zhiwei, Koh, Pang Wei, Russakovsky, Olga
Visual Instruction Tuning typically requires a large amount of vision-language training data. This data often containing redundant information that increases computational costs without proportional performance gains. In this work, we introduce ICONS, a gradient-driven Influence CONsensus approach for vision-language data Selection that selects a compact training dataset for efficient multi-task training. The key element of our approach is cross-task influence consensus, which uses majority voting across task-specific influence matrices to identify samples that are consistently valuable across multiple tasks, allowing us to effectively prioritize data that optimizes for overall performance. Experiments show that models trained on our selected data (20% of LLaVA-665K) achieve 98.6% of the relative performance obtained using the full dataset. Additionally, we release this subset, LLaVA-ICONS-133K, a compact yet highly informative subset of LLaVA-665K visual instruction tuning data, preserving high impact training data for efficient vision-language model development.
Influential Language Data Selection via Gradient Trajectory Pursuit
Deng, Zhiwei, Li, Tao, Li, Yang
Curating a desirable dataset for training has been the core of building highly capable large language models (Touvron et al., 2023; Achiam et al., 2023; Team et al., 2024). Gradient influence scores (Pruthi et al., 2020; Xia et al., 2024) are shown to be correlated with model performance and are commonly used as the criterion for data selection. However, existing methods are built upon either individual sample rankings or inefficient matching process, leading to suboptimal performance or scaling up issues. In this paper, we propose Gradient Trajectory Pursuit (GTP), an algorithm that performs pursuit of gradient trajectories via jointly selecting data points under an L0-norm regularized objective. The proposed algorithm highlights: (1) joint selection instead of independent top-k selection, which automatically de-duplicates samples; (2) higher efficiency with compressive sampling processes, which can be further sped up using a distributed framework. In the experiments, we demonstrate the algorithm in both in-domain and target-domain selection benchmarks and show that it outperforms top-k selection and competitive algorithms consistently, for example, our algorithm chooses as low as 0.5% data to achieve full performance on the targeted instruction tuning tasks. Large language models encompasses an enormous amount of knowledge acquired through pretraining corpus (Brown et al., 2020; Jiang et al., 2023; Touvron et al., 2023). A crucial component that makes language models practically useful is the post-training phase, which empower the model with a variety of extra abilities and skills, such as aligning language models to follow human instructions (Ouyang et al., 2022; Taori et al., 2023; Wang et al., 2023b; Xu et al., 2023), teaching models the tool use (Schick et al., 2023) or exploring environments (Carta et al., 2023; Lin et al., 2023).
A Label is Worth a Thousand Images in Dataset Distillation
Qin, Tian, Deng, Zhiwei, Alvarez-Melis, David
Data $\textit{quality}$ is a crucial factor in the performance of machine learning models, a principle that dataset distillation methods exploit by compressing training datasets into much smaller counterparts that maintain similar downstream performance. Understanding how and why data distillation methods work is vital not only for improving these methods but also for revealing fundamental characteristics of "good" training data. However, a major challenge in achieving this goal is the observation that distillation approaches, which rely on sophisticated but mostly disparate methods to generate synthetic data, have little in common with each other. In this work, we highlight a largely overlooked aspect common to most of these methods: the use of soft (probabilistic) labels. Through a series of ablation experiments, we study the role of soft labels in depth. Our results reveal that the main factor explaining the performance of state-of-the-art distillation methods is not the specific techniques used to generate synthetic data but rather the use of soft labels. Furthermore, we demonstrate that not all soft labels are created equal; they must contain $\textit{structured information}$ to be beneficial. We also provide empirical scaling laws that characterize the effectiveness of soft labels as a function of images-per-class in the distilled dataset and establish an empirical Pareto frontier for data-efficient learning. Combined, our findings challenge conventional wisdom in dataset distillation, underscore the importance of soft labels in learning, and suggest new directions for improving distillation methods. Code for all experiments is available at https://github.com/sunnytqin/no-distillation.
What is Dataset Distillation Learning?
Yang, William, Zhu, Ye, Deng, Zhiwei, Russakovsky, Olga
Dataset distillation has emerged as a strategy to overcome the hurdles associated with large datasets by learning a compact set of synthetic data that retains essential information from the original dataset. While distilled data can be used to train high performing models, little is understood about how the information is stored. In this study, we posit and answer three questions about the behavior, representativeness, and point-wise information content of distilled data. We reveal distilled data cannot serve as a substitute for real data during training outside the standard evaluation setting for dataset distillation. Additionally, the distillation process retains high task performance by compressing information related to the early training dynamics of real models. Finally, we provide an framework for interpreting distilled data and reveal that individual distilled data points contain meaningful semantic information. This investigation sheds light on the intricate nature of distilled data, providing a better understanding on how they can be effectively utilized.
Distributional Dataset Distillation with Subtask Decomposition
Qin, Tian, Deng, Zhiwei, Alvarez-Melis, David
What does a neural network learn when training from a task-specific dataset? Synthesizing this knowledge is the central idea behind Dataset Distillation, which recent work has shown can be used to compress large datasets into a small set of input-label pairs ($\textit{prototypes}$) that capture essential aspects of the original dataset. In this paper, we make the key observation that existing methods distilling into explicit prototypes are very often suboptimal, incurring in unexpected storage cost from distilled labels. In response, we propose $\textit{Distributional Dataset Distillation}$ (D3), which encodes the data using minimal sufficient per-class statistics and paired with a decoder, we distill dataset into a compact distributional representation that is more memory-efficient compared to prototype-based methods. To scale up the process of learning these representations, we propose $\textit{Federated distillation}$, which decomposes the dataset into subsets, distills them in parallel using sub-task experts and then re-aggregates them. We thoroughly evaluate our algorithm on a three-dimensional metric and show that our method achieves state-of-the-art results on TinyImageNet and ImageNet-1K. Specifically, we outperform the prior art by $6.9\%$ on ImageNet-1K under the storage budget of 2 images per class.
Perceptual Group Tokenizer: Building Perception with Iterative Grouping
Deng, Zhiwei, Chen, Ting, Li, Yang
Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping (Palmer, 2002; Wagemans et al., 2012; Herzog, 2018). Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 80.3% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, establishing a new milestone for this paradigm. Ever since the surge of deep learning, feature detection has predominated the vision field and become the main principle behind representation learning backbone designs and made impressive progress (Simonyan & Zisserman, 2014; Szegedy et al., 2015; He et al., 2016; Chen et al., 2017; Tan & Le, 2019; Qi et al., 2020; Liu et al., 2022b). The success of the former paradigm is, although striking, raising the question of whether perceptual grouping can also be used as the driving principle to construct a visual recognition model. Different from detecting and selecting distinctive features, perceptual grouping emphasizes on learning feature space where similarity of all pixels can be effectively measured (Uijlings et al., 2013; Arbelรกez et al., 2014). With such a feature space, semantically meaningful objects and regions can be easily discovered with a simple grouping algorithm and used as a compact set to represent an image (Uijlings et al., 2013; Arbelรกez et al., 2014; Locatello et al., 2020).
A Zero-Shot Language Agent for Computer Control with Structured Reflection
Li, Tao, Li, Gang, Deng, Zhiwei, Wang, Bryan, Li, Yang
Large language models (LLMs) have shown increasing capacity at planning and executing a high-level goal in a live computer environment (e.g. MiniWoB++). To perform a task, recent works often require a model to learn from trace examples of the task via either supervised learning or few/many-shot prompting. Without these trace examples, it remains a challenge how an agent can autonomously learn and improve its control on a computer, which limits the ability of an agent to perform a new task. We approach this problem with a zero-shot agent that requires no given expert traces. Our agent plans for executable actions on a partially observed environment, and iteratively progresses a task by identifying and learning from its mistakes via self-reflection and structured thought management. On the easy tasks of MiniWoB++, we show that our zero-shot agent often outperforms recent SoTAs, with more efficient reasoning. For tasks with more complexity, our reflective agent performs on par with prior best models, even though previous works had the advantages of accessing expert traces or additional screen information.
Unseen Image Synthesis with Diffusion Models
Zhu, Ye, Wu, Yu, Deng, Zhiwei, Russakovsky, Olga, Yan, Yan
While the current trend in the generative field is scaling up towards larger models and more training data for generalized domain representations, we go the opposite direction in this work by synthesizing unseen domain images without additional training. We do so via latent sampling and geometric optimization using pretrained and frozen Denoising Diffusion Probabilistic Models (DDPMs) on singledomain datasets. Our key observation is that DDPMs pre-trained even just on single-domain images are already equipped with sufficient representation abilities to reconstruct arbitrary images from the inverted latent encoding following bidirectional deterministic diffusion and denoising trajectories. This motivates us to investigate the statistical and geometric behaviors of the Out-Of-Distribution (OOD) samples from unseen image domains in the latent spaces along the denoising chain. Notably, we theoretically and empirically show that the inverted OOD samples also establish Gaussians that are distinguishable from the original In-Domain (ID) samples in the intermediate latent spaces, which allows us to sample from them directly. Geometrical domain-specific and model-dependent information of the unseen subspace (e.g., sample-wise distance and angles) is used to further optimize the sampled OOD latent encodings from the estimated Gaussian prior. We conduct extensive analysis and experiments using pre-trained diffusion models (DDPM, iDDPM) on different datasets (AFHQ, CelebA-HQ, LSUN-Church, and LSUN-Bedroom), proving the effectiveness of this novel perspective to explore and re-think the diffusion models' data synthesis generalization ability. "Seek inward in face of difficulties." The current research trend focuses on leveraging larger models and more training data as to facilitate improved generalization. The popularity of recent large-scale models such as DALLE-2 (Ramesh et al., 2022) and Imagen (Ho et al., 2022a) have demonstrated the impressive and promising representation ability when trained on an enormous amount of data. This work includes abundant supporting analysis, qualitative examples and discussions in the supplementary.
Remember the Past: Distilling Datasets into Addressable Memories for Neural Networks
Deng, Zhiwei, Russakovsky, Olga
We propose an algorithm that compresses the critical information of a large dataset into compact addressable memories. These memories can then be recalled to quickly re-train a neural network and recover the performance (instead of storing and re-training on the full original dataset). Building upon the dataset distillation framework, we make a key observation that a shared common representation allows for more efficient and effective distillation. Concretely, we learn a set of bases (aka ``memories'') which are shared between classes and combined through learned flexible addressing functions to generate a diverse set of training examples. This leads to several benefits: 1) the size of compressed data does not necessarily grow linearly with the number of classes; 2) an overall higher compression rate with more effective distillation is achieved; and 3) more generalized queries are allowed beyond recalling the original classes. We demonstrate state-of-the-art results on the dataset distillation task across six benchmarks, including up to 16.5% and 9.7% in retained accuracy improvement when distilling CIFAR10 and CIFAR100 respectively. We then leverage our framework to perform continual learning, achieving state-of-the-art results on four benchmarks, with 23.2% accuracy improvement on MANY. The code is released on our project webpage https://github.com/princetonvisualai/RememberThePast-DatasetDistillation.
Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
Deng, Zhiwei, Narasimhan, Karthik, Russakovsky, Olga
The ability to perform effective planning is crucial for building an instruction-following agent. When navigating through a new environment, an agent is challenged with (1) connecting the natural language instructions with its progressively growing knowledge of the world; and (2) performing long-range planning and decision making in the form of effective exploration and error correction. Current methods are still limited on both fronts despite extensive efforts. In this paper, we introduce the Evolving Graphical Planner (EGP), a model that performs global planning for navigation based on raw sensory input. The model dynamically constructs a graphical representation, generalizes the action space to allow for more flexible decision making, and performs efficient planning on a proxy graph representation. We evaluate our model on a challenging Vision-and-Language Navigation (VLN) task with photorealistic images and achieve superior performance compared to previous navigation architectures. For instance, we achieve a 53% success rate on the test split of the Room-to-Room navigation task through pure imitation learning, outperforming previous navigation architectures by up to 5%.