Well File:

Overfitting Can Be Harmless for Basis Pursuit, But Only to a Degree

Neural Information Processing Systems

Recently, there have been significant interests in studying the so-called "doubledescent" of the generalization error of linear regression models under the overparameterized and overfitting regime, with the hope that such analysis may provide the first step towards understanding why overparameterized deep neural networks (DNN) still generalize well.


A Basic Functions

Neural Information Processing Systems

Each question in PTR is associated with a functional program built from a set of basic functions. A.1 Data Types Our basic functional building blocks operate on values of the following types: Object: A single object in the scene. Boolean: Yes or No. Value Types: - Object Category: Chair, Bed, Table, Refrigerator, Cart - Part Category: arm, leg, back, seat, central support pedestal, leg bar, wheel, arm vertical bar, arm horizontal bar, door, sleep area, top, drawer, shelf, body - Color: gray, red, blue, green, brown, purple, cyan, yellow - Stability: Stable, Unstable - Possible Change: to_left, to_right, to_front, to_behind Spatial Relationship: left, right, in front of, behind, above, below Geometric Relationship: line-line perpendicular, line-line parallel, plane-plane perpendicular, plane-plane parallel, line-plane perpendicular, line-plane parallel A.2 Object-Level Functions Object-level functions focus on object-level reasoning, and are listed in Table 3. A.3 Part-Level Functions Since concepts, attributes and relationships are defined on the semantic level rather than instance level, we do not use a single Part. Rather, we use PartSet to denote both a set of parts of the same semantics, as well as a set of parts of different semantics. A dictionary keeps the correspondence between objects and parts (e.g.,, {obj0: [part0, part1, part2], obj1: [part3, part4]...}), which facilitates hierarchical reasoning.


Lookback Prophet Inequalities

Neural Information Processing Systems

Prophet inequalities are fundamental optimal stopping problems, where a decisionmaker observes sequentially items with values sampled independently from known distributions, and must decide at each new observation to either stop and gain the current value or reject it irrevocably and move to the next step. This model is often too pessimistic and does not adequately represent real-world online selection processes. Potentially, rejected items can be revisited and a fraction of their value can be recovered.


Supplementary Materials for Equivariant Graph Hierarchy-Based Neural Networks

Neural Information Processing Systems

A.1 Proof of Theorem 1 Theorem 1. EMMP can reduce to EGNN and GMN by specific choices of MLP in Eq. 3. Proof. We first prove that EMMP can reduce to GMN. [ ] This theorem basically implies that the expressivity of our EMMP is stronger than that of GMN or EGNN. A.2 Proof of Theorem 2 Theorem 2. EMMP, E-Pool, and E-UnPool are all E(n)-equivariant. We proceed the proof step by step, following the definition of EMMP in Eq. 3-6: แบ Indeed, with Theorem 2 we immediately have that any cascade of EMMP, E-Pool, and E-UnPool is also E(n)-equivariant. For the baselines, we leverage the codebases maintained by [4] We tune the hyper-parameters around the suggested hyper-parameters as specified in [4] and [6] for the baselines.


Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness Shaokui Wei

Neural Information Processing Systems

The security threat of backdoor attacks is a central concern for deep neural networks (DNNs). Recently, without poisoned data, unlearning models with clean data and then learning a pruning mask have contributed to backdoor defense. Additionally, vanilla fine-tuning with those clean data can help recover the lost clean accuracy. However, the behavior of clean unlearning is still under-explored, and vanilla fine-tuning unintentionally induces back the backdoor effect. In this work, we first investigate model unlearning from the perspective of weight changes and gradient norms, and find two interesting observations in the backdoored model: 1) the weight changes between poison and clean unlearning are positively correlated, making it possible for us to identify the backdoored-related neurons without using poisoned data; 2) the neurons of the backdoored model are more active (i.e., larger gradient norm) than those in the clean model, suggesting the need to suppress the gradient norm during fine-tuning. Then, we propose an effective two-stage defense method. In the first stage, an efficient Neuron Weight Change (NWC)-based Backdoor Reinitialization is proposed based on observation 1). In the second stage, based on observation 2), we design an Activeness-Aware Fine-Tuning to replace the vanilla fine-tuning. Extensive experiments, involving eight backdoor attacks on three benchmark datasets, demonstrate the superior performance of our proposed method compared to recent state-of-the-art backdoor defense approaches.


Deep Transformation-Invariant Clustering

Neural Information Processing Systems

Recent advances in image clustering typically focus on learning better deep representations. In contrast, we present an orthogonal approach that does not rely on abstract features but instead learns to predict transformations and performs clustering directly in pixel space. This learning process naturally fits in the gradient-based training of K-means and Gaussian mixture model, without requiring any additional loss or hyper-parameters. It leads us to two new deep transformation-invariant clustering frameworks, which jointly learn prototypes and transformations. More specifically, we use deep learning modules that enable us to resolve invariance to spatial, color and morphological transformations. Our approach is conceptually simple and comes with several advantages, including the possibility to easily adapt the desired invariance to the task and a strong interpretability of both cluster centers and assignments to clusters. We demonstrate that our novel approach yields competitive and highly promising results on standard image clustering benchmarks. Finally, we showcase its robustness and the advantages of its improved interpretability by visualizing clustering results over real photograph collections.


Deep Transformation-Invariant Clustering

Neural Information Processing Systems

We thank the reviewers (Rs) for their positive feedback. If accepted, we will incorporate all feedback in the final version. As stated L160-162, we follow STN [34] to model the spatial transformations, i.e. we We will make this more explicit. To ensure complete reproducibility, we will release code, data and models. Our results indeed depend on initialization, we will make this more explicit.


4a3a96231b8240f11483afd196227278-Paper-Conference.pdf

Neural Information Processing Systems

We propose the new task'open-world video instance segmentation and captioning'. It requires to detect, segment, track and describe with rich captions never before seen objects. This challenging task can be addressed by developing "abstractors" which connect a vision model and a language foundation model. Concretely, we connect a multi-scale visual feature extractor and a large language model (LLM) by developing an object abstractor and an object-to-text abstractor. The object abstractor, consisting of a prompt encoder and transformer blocks, introduces spatially-diverse open-world object queries to discover never before seen objects in videos. An inter-query contrastive loss further encourages the diversity of object queries. The object-to-text abstractor is augmented with masked cross-attention and acts as a bridge between the object queries and a frozen LLM to generate rich and descriptive object-centric captions for each detected object. Our generalized approach surpasses the baseline that jointly addresses the tasks of open-world video instance segmentation and dense video object captioning by 13% on never before seen objects, and by 10% on object-centric captions.


Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation

Neural Information Processing Systems

In the past few years, transformers have achieved promising performance on various computer vision tasks. Unfortunately, the immense inference overhead of most existing vision transformers withholds them from being deployed on edge devices such as cell phones and smart watches. Knowledge distillation is a widely used paradigm for compressing cumbersome architectures into compact students via transferring information. However, most of them are designed for convolutional neural networks (CNNs), which do not fully investigate the character of vision transformers. In this paper, we fully utilize the patch-level information and propose a fine-grained manifold distillation method for transformer-based networks. Specifically, we train a tiny student model to match a pre-trained teacher model in the patch-level manifold space. Then, we decouple the manifold matching loss into three terms with careful design to further reduce the computational costs for the patch relationship. Equipped with the proposed method, a DeiT-Tiny model containing 5M parameters achieves 76.5% top-1 accuracy on ImageNet-1k, which is +2.0%


MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models Gongwei Chen

Neural Information Processing Systems

Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks.