A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks.
7a677bb4477ae2dd371add568dd19e23-AuthorFeedback.pdf
Thank reviewers for detailed comments. By "generic" we mean the model can be applied to different tasks that share the same problem structure. We're happy to revise the terminology and highlight what applications TGSL is appropriate for, namely, those where the input and output show certain resemblance. We mentioned in Line 269 that SA+MM cannot achieve reasonable performance. To our best knowledge, we are the first to work in this direction.
Single-Loop Stochastic Algorithms for Difference of Max-Structured Weakly Convex Functions Qi Qi2 Zhaosong Lu3 Tianbao Yang
It covers two families of problems that have been studied but are missing single-loop stochastic algorithms, i.e., difference of weakly convex functions and weakly convex stronglyconcave min-max problems. We propose a stochastic Moreau envelope approximate gradient method dubbed SMAG, the first single-loop algorithm for solving these problems, and provide a state-of-the-art non-asymptotic convergence rate. The key idea of the design is to compute an approximate gradient of the Moreau envelopes of Φ, Ψ using only one step of stochastic gradient update of the primal and dual variables. Empirically, we conduct experiments on positive-unlabeled (PU) learning and partial area under ROC curve (pAUC) optimization with an adversarial fairness regularizer to validate the effectiveness of our proposed algorithms.
Deep Homomorphism Networks
Many real-world graphs are large and have some characteristic subgraph patterns, such as triangles in social networks, cliques in web graphs, and cycles in molecular networks. Detecting such subgraph patterns is important in many applications; therefore, establishing graph neural networks (GNNs) that can detect such patterns and run fast on large graphs is demanding. In this study, we propose a new GNN layer, named graph homomorphism layer.
Supplementary Material: Consistency Regularization for Certified Robustness of Smoothed Classifiers
A.1 Training details We train every model via stochastic gradient descent (SGD) with Nesterov momentum of weight 0.9 without dampening. When SmoothAdv is used, we adopt the warm-up strategy on attack radius ε [8], i.e., ε is initially set to zero, and linearly increased during the first 10 epochs to a pre-defined hyperparameter. Each of the images is labeled from 0 to 9, i.e., there are 10 classes. When training on MNIST, we do not perform any pre-processing except for normalizing the range of each pixel from 0-255 to 0-1. The full dataset can be downloaded at http://yann.lecun.com/
Consistency Regularization for Certified Robustness of Smoothed Classifiers Jinwoo Shin
In this paradigm, one should rethink the notion of adversarial robustness in terms of generalization ability of a classifier under noisy observations. We found that the trade-off between accuracy and certified robustness of smoothed classifiers can be greatly controlled by simply regularizing the prediction consistency over noise. This relationship allows us to design a robust training objective without approximating a non-existing smoothed classifier, e.g., via soft smoothing. Our experiments under various deep neural network architectures and datasets show that the "certified" l
770f8e448d07586afbf77bb59f698587-AuthorFeedback.pdf
Thank you for your thoughtful feedback. We will first discuss common themes and then specific reviewer comments. Significance: Even though ExpO is "simple" (in that it connects existing concepts, albeit in a novel way), we believe We will add a discussion as outlined below. "Adversarial Robustness... " by Qin et al does not consider interpretability at all. Several methods rely on domain knowledge: "Learning credible... " by Du et al, "Learning Deep... " by Weinberger "compare to RRR in this manner."
MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer
Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose Weight Merging Regularization, which regularizes the merging process of experts in weight space.
Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model Mingyang Yi
Recently, the strong latent Diffusion Probabilistic Model (DPM) has been applied to high-quality Text-to-Image (T2I) generation (e.g., Stable Diffusion), by injecting the encoded target text prompt into the gradually denoised diffusion image generator. Despite the success of DPM in practice, the mechanism behind it remains to be explored. To fill this blank, we begin by examining the intermediate statuses during the gradual denoising generation process in DPM. The empirical observations indicate, the shape of image is reconstructed after the first few denoising steps, and then the image is filled with details (e.g., texture). The phenomenon is because the low-frequency signal (shape relevant) of the noisy image is not corrupted until the final stage in the forward process (initial stage of generation) of adding noise in DPM.
ViSioNS: Visual Search in Natural Scenes Benchmark Gonzalo Ruarte 1* Juan E. Kamienkowski
Visual search is an essential part of almost any everyday human interaction with the visual environment [1, 2]. Nowadays, several algorithms are able to predict gaze positions during simple observation, but few models attempt to simulate human behavior during visual search in natural scenes. Furthermore, these models vary widely in their design and exhibit differences in the datasets and metrics with which they were evaluated. Thus, there is a need for a reference point, on which each model can be tested and from where potential improvements can be derived. In this study, we select publicly available state-of-the-art visual search models and datasets in natural scenes, and provide a common framework for their evaluation. To this end, we apply a unified format and criteria, bridging the gaps between them, and we estimate the models' efficiency and similarity with humans using a specific set of metrics. This integration has allowed us to enhance the Ideal Bayesian Searcher by combining it with a neural network-based visual search model, which enables it to generalize to other datasets. The present work sheds light on the limitations of current models and how integrating different approaches with a unified criteria can lead to better algorithms. Moreover, it moves forward on bringing forth a solution for the urgent need of benchmarking data and metrics to support the development of more general human visual search computational models.