Vision
Locating What You Need: Towards Adapting Diffusion Models to OOD Concepts In-the-Wild
The recent large-scale text-to-image generative models have attained unprecedented performance, while people established adaptor modules like LoRA and DreamBooth to extend this performance to even more unseen concept tokens. However, we empirically find that this workflow often fails to accurately depict the out-of-distribution concepts. This failure is highly related to the low quality of training data. To resolve this, we present a framework called Controllable Adaptor Towards Out-of-Distribution Concepts (CATOD). Our framework follows the active learning paradigm which includes high-quality data accumulation and adaptor training, enabling a finer-grained enhancement of generative results. The aesthetics score and concept-matching score are two major factors that impact the quality of synthetic results. One key component of CATOD is the weighted scoring system that automatically balances between these two scores and we also offer comprehensive theoretical analysis for this point. Then, it determines how to select data and schedule the adaptor training based on this scoring system. The extensive results show that CATOD significantly outperforms the prior approaches with an 11.10 boost on the CLIP score and a 33.08% decrease on the CMMD metric.
Beyond Euclidean: Dual-Space Representation Learning for Weakly Supervised Video Violence Detection
While numerous Video Violence Detection (VVD) methods have focused on representation learning in Euclidean space, they struggle to learn sufficiently discriminative features, leading to weaknesses in recognizing normal events that are visually similar to violent events (i.e., ambiguous violence). In contrast, hyperbolic representation learning, renowned for its ability to model hierarchical and complex relationships between events, has the potential to amplify the discrimination between visually similar events. Inspired by these, we develop a novel Dual-Space Representation Learning (DSRL) method for weakly supervised VVD to utilize the strength of both Euclidean and hyperbolic geometries, capturing the visual features of events while also exploring the intrinsic relations between events, thereby enhancing the discriminative capacity of the features. DSRL employs a novel information aggregation strategy to progressively learn event context in hyperbolic spaces, which selects aggregation nodes through layer-sensitive hyperbolic association degrees constrained by hyperbolic Dirichlet energy. Furthermore, DSRL attempts to break the cyber-balkanization of different spaces, utilizing cross-space attention to facilitate information interactions between Euclidean and hyperbolic space to capture better discriminative features for final violence detection. Comprehensive experiments demonstrate the effectiveness of our proposed DSRL.
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Vision-language models (VLMs) have made significant progress in recent visualquestion-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using offthe-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples.
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control
Research on video generation has recently made tremendous progress, enabling high-quality videos to be generated from text prompts or images. Adding control to the video generation process is an important goal moving forward and recent approaches that condition video generation models on camera trajectories make strides towards it. Yet, it remains challenging to generate a video of the same scene from multiple different camera trajectories. Solutions to this multi-video generation problem could enable large-scale 3D scene generation with editable camera trajectories, among other applications. We introduce collaborative video diffusion (CVD) as an important step towards this vision. The CVD framework includes a novel cross-video synchronization module that promotes consistency between corresponding frames of the same video rendered from different camera poses using an epipolar attention mechanism. Trained on top of a state-of-the-art camera-control module for video generation, CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines, as shown in extensive experiments.
MTGS: A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction
Gaze following and social gaze prediction are fundamental tasks providing insights into human communication behaviors, intent, and social interactions. Most previous approaches addressed these tasks separately, either by designing highly specialized social gaze models that do not generalize to other social gaze tasks or by considering social gaze inference as an ad-hoc post-processing of the gaze following task. Furthermore, the vast majority of gaze following approaches have proposed models that can handle only one person at a time and are static, therefore failing to take advantage of social interactions and temporal dynamics. In this paper, we address these limitations and introduce a novel framework to jointly predict the gaze target and social gaze label for all people in the scene. It comprises (i) a temporal, transformer-based architecture that, in addition to frame tokens, handles personspecific tokens capturing the gaze information related to each individual; (ii) a new dataset, VSGaze, built from multiple gaze following and social gaze datasets by extending and validating head detections and tracks, and unifying annotation types. We demonstrate that our model can address and benefit from training on all tasks jointly, achieving state-of-the-art results for multi-person gaze following and social gaze prediction. Our annotations and code will be made publicly available.
Coarse-to-fine Animal Pose and Shape Estimation: Supplementary Material
We conduct further ablation studies for our approach in this supplementary material, including comparison with test-time optimization and sensitivity analysis of the refinement stage. Additional qualitative results are also provided. We compare our coarse-to-fine approach with the testtime optimization approach. As has been done in our coarse-to-fine pipeline, we also use the output from our coarse estimation stage as an initialization. Instead of apply the mesh refinement GCN, we further optimize the SMAL parameters based on the keypoints and silhouettes for 10, 50, 100, 200 iterations, respectively.
Overleaf Example
Most existing animal pose and shape estimation approaches reconstruct animal meshes with a parametric SMAL model. This is because the low-dimensional pose and shape parameters of the SMAL model makes it easier for deep networks to learn the high-dimensional animal meshes. However, the SMAL model is learned from scans of toy animals with limited pose and shape variations, and thus may not be able to represent highly varying real animals well. This may result in poor fittings of the estimated meshes to the 2D evidences, e.g.
AROMA: Preserving Spatial Structure for Latent PDE Modeling with Local Neural Fields Louis Serrano 1 Thomas X Wang 1 Jean-Noël Vittaut 3
We present AROMA (Attentive Reduced Order Model with Attention), a framework designed to enhance the modeling of partial differential equations (PDEs) using local neural fields. Our flexible encoder-decoder architecture can obtain smooth latent representations of spatial physical fields from a variety of data types, including irregular-grid inputs and point clouds. This versatility eliminates the need for patching and allows efficient processing of diverse geometries. The sequential nature of our latent representation can be interpreted spatially and permits the use of a conditional transformer for modeling the temporal dynamics of PDEs. By employing a diffusion-based formulation, we achieve greater stability and enable longer rollouts compared to conventional MSE training. AROMA's superior performance in simulating 1D and 2D equations underscores the efficacy of our approach in capturing complex dynamical behaviors.
Self Supervised Learning by Cross Modal Audio Video Clustering Supplementary Material
In this section, we give the details of the full optimization cycle and discuss differences between the single-modality baseline and our multi-modal models. As discussed in [1], SDC may converge to trivial solutions, corresponding to empty clusters or encoder parameterizations, where the classifier predicts the same label regardless of the input. DeepCluster proposes workarounds to tackle these issues, involving reassigning empty cluster centers and sampling training images uniformly over the cluster assignments. While these strategies mitigate the issues, they do not fix the main cause of the problem: SDC learns a discriminative classifier on the same input from which it learns the labels. On the other hand, our multi-modal deep clustering models are less prone to trivial solutions because they learn the discriminative classifier on one modality and obtain the labels from a different modality.