Neural Information Processing Systems
DiffuBox: Refining 3D Object Detection with Point Diffusion Katie Z Luo
Ensuring robust 3D object detection and localization is crucial for many applications in robotics and autonomous driving. Recent models, however, face difficulties in maintaining high performance when applied to domains with differing sensor setups or geographic locations, often resulting in poor localization accuracy due to domain shift. To overcome this challenge, we introduce a novel diffusion-based box refinement approach. This method employs a domain-agnostic diffusion model, conditioned on the LiDAR points surrounding a coarse bounding box, to simultaneously refine the box's location, size, and orientation. We evaluate this approach under various domain adaptation settings, and our results reveal significant improvements across different datasets, object classes and detectors.
OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators Allen Nie 1 Christina J. Yuan
Offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance by leveraging historical interaction data collected from other policies. Evaluating a new policy online without a confident estimate of its performance can lead to costly, unsafe, or hazardous outcomes, especially in education and healthcare. Several OPE estimators have been proposed in the last decade, many of which have hyperparameters and require training. Unfortunately, choosing the best OPE algorithm for each task and domain is still unclear. In this paper, we propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure. We prove that our estimator is consistent and satisfies several desirable properties for policy evaluation. Additionally, we demonstrate that when compared to alternative approaches, our estimator can be used to select higher-performing policies in healthcare and robotics. Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
Private Attribute Inference from Images with Vision-Language Models
As large language models (LLMs) become ubiquitous in our daily tasks and digital interactions, associated privacy risks are increasingly in focus. While LLM privacy research has primarily focused on the leakage of model training data, it has recently been shown that LLMs can make accurate privacy-infringing inferences from previously unseen texts. With the rise of vision-language models (VLMs), capable of understanding both images and text, a key question is whether this concern transfers to the previously unexplored domain of benign images posted online. To answer this question, we compile an image dataset with human-annotated labels of the image owner's personal attributes. In order to understand the privacy risks posed by VLMs beyond traditional human attribute recognition, our dataset consists of images where the inferable private attributes do not stem from direct depictions of humans. On this dataset, we evaluate 7 state-of-the-art VLMs, finding that they can infer various personal attributes at up to 77.6% accuracy. Concerningly, we observe that accuracy scales with the general capabilities of the models, implying that future models can be misused as stronger inferential adversaries, establishing an imperative for the development of adequate defenses.
SF-V: Single Forward Video Generation Model
Diffusion-based video generation models have demonstrated remarkable success in obtaining high-fidelity videos through the iterative denoising process. However, these models require multiple denoising steps during sampling, resulting in high computational costs. In this work, we propose a novel approach to obtain singlestep video generation models by leveraging adversarial training to fine-tune pretrained video diffusion models. We show that, through the adversarial training, the multi-steps video diffusion model, i.e., Stable Video Diffusion (SVD), can be trained to perform single forward pass to synthesize high-quality videos, capturing both temporal and spatial dependencies in the video data. Extensive experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead for the denoising process (i.e., around 23 speedup compared with SVD and 6 speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing. Work done during an internship at Snap Inc.
Coordinated hippocampal-entorhinal replay as structural inference
Constructing and maintaining useful representations of sensory experience is essential for reasoning about ones environment. High-level associative (topological) maps can be useful for efficient planning and are easily constructed from experience. Conversely, embedding new experiences within a metric structure allows them to be integrated with existing ones and novel associations to be implicitly inferred. Neurobiologically, the synaptic associations between hippocampal place cells and entorhinal grid cells are thought to represent associative and metric structures, respectively. Learning the place-grid cell associations can therefore be interpreted as learning a mapping between these two spaces. Here, we show how this map could be constructed by probabilistic message-passing through the hippocampalentorhinal system, where messages are scheduled to reduce the propagation of redundant information. We propose that this offline inference corresponds to coordinated hippocampal-entorhinal replay during sharp wave ripples. Our results also suggest that the metric map will contain local distortions that reflect the inferred structure of the environment according to associative experience, explaining observed grid deformations.
aa36c88c27650af3b9868b723ae15dfc-AuthorFeedback.pdf
We thank all reviewers for their time and valuable comments. We thank this reviewer for the positive feedback! "The theoretical sample complexity is not significantly improved over previously-known methods." The main contribution of our paper is to show that an existing and popular algorithm (i.e., group-sparse regularized We view the sample complexity improvement over the dependence on k as a side benefit of our analysis. "It would be interesting to see a more thorough empirical evaluation, to compare with the interaction screening The main contribution of our paper is theoretical. Our graph has diamond shape (Figure 1 of our paper), 10 variables and edge weight 0.2. This observation is actually the starting point of our paper. We will include this discussion in our paper. "The presentation is quite technical...the Ising case seems to be enough to introduce the main idea...but a lot of For learning non-binary graphical models, we see a benefit of using the group-sparse (i.e., the l "Experiments are only presented for rather small examples (up to 14 variables, up to k = 6)."
DA-Ada: Learning Domain-Aware Adapter for Domain Adaptive Object Detection
Domain adaptive object detection (DAOD) aims to generalize detectors trained on an annotated source domain to an unlabelled target domain. As the visual-language models (VLMs) can provide essential general knowledge on unseen images, freezing the visual encoder and inserting a domain-agnostic adapter can learn domaininvariant knowledge for DAOD. However, the domain-agnostic adapter is inevitably biased to the source domain.
From Dictionary to Tensor: A Scalable Multi-View Subspace Clustering Framework with Triple Information Enhancement Zhibin Gu1 College of Computer and Cyber Security, Hebei Normal University, China
While Tensor-based Multi-view Subspace Clustering (TMSC) has garnered significant attention for its capacity to effectively capture high-order correlations among multiple views, three notable limitations in current TMSC methods necessitate consideration: 1) high computational complexity and reliance on dictionary completeness resulting from using observed data as the dictionary, 2) inaccurate subspace representation stemming from the oversight of local geometric information and 3) under-penalization of noise-related singular values within tensor data caused by treating all singular values equally. To address these limitations, this paper presents a Scalable TMSC framework with Triple infOrmatioN Enhancement (STONE). Notably, an enhanced anchor dictionary learning mechanism has been utilized to recover the low-rank anchor structure, resulting in reduced computational complexity and increased resilience, especially in scenarios with inadequate dictionaries. Additionally, we introduce an anchor hypergraph Laplacian regularizer to preserve the inherent geometry of the data within the subspace representation. Simultaneously, an improved hyperbolic tangent function has been employed as a precise approximation for tensor rank, effectively capturing the significant variations in singular values. Extensive experiments on a variety of datasets show that the STONE outperforms SOTA approaches in both effectiveness and efficiency.