Well File:


Image Understanding Makes for A Good Tokenizer for Image Generation

Neural Information Processing Systems

Modern image generation (IG) models have been shown to capture rich semantics valuable for image understanding (IU) tasks. However, the potential of IU models to improve IG performance remains uncharted. We address this issue using a token-based IG framework, which relies on effective tokenizers to map images into token sequences. Currently, pixel reconstruction (e.g., VQGAN) dominates the training objective for tokenizers. In contrast, our approach adopts the feature reconstruction objective, where tokenizers are trained by distilling knowledge from pretrained IU encoders. Comprehensive comparisons indicate that tokenizers with strong IU capabilities achieve superior IG performance across a variety of metrics, datasets, tasks, and proposal networks.


Graph Contrastive Learning with Augmentations (Appendix)

Neural Information Processing Systems

We evaluate our proposed framework with different augmentation pairs in the semi-supervised learning setting on graph classification [1] via pre-training & finetuing where pre-training is performed with 100 epochs, 0.001 learning rate, and finetuning follows the 10-fold evaluation finetuning in [2] that achieves the comparable SOTA performance in the fully-supervised setting. Graph convolutional network (GCN) is adopted as the GNN-based encoder also following [2]. Experiments are performed with 1% (if there are over 10 samples for each class) and 10% label rate for 5 times with mean and standard deviation of accuracies (%) reported. Superpixel graphs (statistics in Table S1) gain from all augmentations except attribute masking as shown in Figure S1. For node dropping, it corresponds to pixel discarding and for subgraph to cropping, which are already shown as useful augmentations in images [3].



3fe230348e9a12c13120749e3f9fa4cd-AuthorFeedback.pdf

Neural Information Processing Systems

We gratefully appreciate all reviewers' suggestions. Below please find the responses to the specific comments. We instead may explore some newer perspectives (e.g. Q: Strengthening the writing of related work, experiment settings and broader impact. We will revise to improve the contents and the structure as suggested.


Disentangling the Roles of Curation, Data-Augmentation and the Prior in the Cold Posterior Effect Kevin Roth

Neural Information Processing Systems

The "cold posterior effect" (CPE) in Bayesian deep learning describes the uncomforting observation that the predictive performance of Bayesian neural networks can be significantly improved if the Bayes posterior is artificially sharpened using a temperature parameter T < 1. The CPE is problematic in theory and practice and since the effect was identified many researchers have proposed hypotheses to explain the phenomenon. However, despite this intensive research effort the effect remains poorly understood. In this work we provide novel and nuanced evidence relevant to existing explanations for the cold posterior effect, disentangling three hypotheses: 1. The dataset curation hypothesis of Aitchison (2020): we show empirically that the CPE does not arise in a real curated data set but can be produced in a controlled experiment with varying curation strength.


Spatial Pruned Sparse Convolution for Efficient 3D Object Detection Xiaoqing Ye

Neural Information Processing Systems

In this paper, we analyze major components of existing sparse 3D CNNs and find that 3D CNNs ignore the redundancy of data and further amplify it in the down-sampling process, which brings a huge amount of extra and unnecessary computational overhead. Inspired by this, we propose a new convolution operator named spatial pruned sparse convolution (SPS-Conv), which includes two variants, spatial pruned submanifold sparse convolution (SPSS-Conv) and spatial pruned regular sparse convolution (SPRS-Conv), both of which are based on the idea of dynamically determining crucial areas for redundancy reduction. We validate that the magnitude can serve as important cues to determine crucial areas which get rid of the extra computations of learning-based methods. The proposed modules can easily be incorporated into existing sparse 3D CNNs without extra architectural modifications. Extensive experiments on the KITTI, Waymo and nuScenes datasets demonstrate that our method can achieve more than 50% reduction in GFLOPs without compromising the performance. Code and models are available at this link.


Solving Minimum-Cost Reach Avoid using Reinforcement Learning

Neural Information Processing Systems

Current reinforcement-learning methods are unable to directly learn policies that solve the minimum cost reach-avoid problem to minimize cumulative costs subject to the constraints of reaching the goal and avoiding unsafe states, as the structure of this new optimization problem is incompatible with current methods. Instead, a surrogate problem is solved where all objectives are combined with a weighted sum. However, this surrogate objective results in suboptimal policies that do not directly minimize the cumulative cost. In this work, we propose RC-PPO, a reinforcement-learning-based method for solving the minimum-cost reach-avoid problem by using connections to Hamilton-Jacobi reachability. Empirical results demonstrate that RC-PPO learns policies with comparable goal-reaching rates to while achieving up to 57% lower cumulative costs compared to existing methods on a suite of minimum-cost reach-avoid benchmarks on the Mujoco simulator. The project page can be found at https://oswinso.xyz/rcppo/.


main

Neural Information Processing Systems

It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Itô Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., (Li et al., 2019a)) only applies to SGD with tiny LR. Experimental verification of the approximation appears computationally infeasible. The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Itô SDE approximation.


Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

Neural Information Processing Systems

Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by over-emphasizing the text tokens that are less correlated with or even contradictory with the input images. In this paper, we advocate for assigning distinct contributions for each text token based on its visual correlation. Specifically, we present by contrasting image inputs, the difference in prediction logits on each text token provides strong guidance of visual correlation. We therefore introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes visually correlated tokens. Our experimental results demonstrate that CAL consistently improves different types of VLMs across different resolutions and model sizes on various benchmarks. Importantly, our method incurs minimal additional computational overhead, rendering it highly efficient compared to alternative data scaling strategies.