Plotting


) for a new query input x

Neural Information Processing Systems

In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs from some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e.g., linear functions): given data derived from some functions in the class, can we train a model to incontext learn "most" functions from this class? We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions--that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the model and inference-time prompts; (ii) between the in-context examples and the query input during inference. We also show that we can train Transformers to in-context learn more complex function classes-- i.e., sparse linear functions, two-layer neural networks, and decision trees--with performance that matches or exceeds task-specific learning algorithms.


VCR-GauS: View Consistent Depth-Normal Regularizer for Gaussian Surface Reconstruction 2 Chen Li1

Neural Information Processing Systems

Although 3D Gaussian Splatting has been widely studied because of its realistic and efficient novel-view synthesis, it is still challenging to extract a high-quality surface from the point-based representation. Previous works improve the surface by incorporating geometric priors from the off-the-shelf normal estimator. However, there are two main limitations: 1) Supervising normals rendered from 3D Gaussians effectively updates the rotation parameter but is less effective for other geometric parameters; 2) The inconsistency of predicted normal maps across multiple views may lead to severe reconstruction artifacts. In this paper, we propose a Depth-Normal regularizer that directly couples normal with other geometric parameters, leading to full updates of the geometric parameters from normal regularization. We further propose a confidence term to mitigate inconsistencies of normal predictions across multiple views. Moreover, we also introduce a densification and splitting strategy to regularize the size and distribution of 3D Gaussians for more accurate surface modeling. Compared with Gaussian-based baselines, experiments show that our approach obtains better reconstruction quality and maintains competitive appearance quality at faster training speed and 100+ FPS rendering.


LG-VQ: Language-Guided Codebook Learning

Neural Information Processing Systems

Vector quantization (VQ) is a key technique in high-resolution and high-fidelity image synthesis, which aims to learn a codebook to encode an image with a sequence of discrete codes and then generate an image in an auto-regression manner. Although existing methods have shown superior performance, most methods prefer to learn a single-modal codebook (e.g., image), resulting in suboptimal performance when the codebook is applied to multi-modal downstream tasks (e.g., text-toimage, image captioning) due to the existence of modal gaps. In this paper, we propose a novel language-guided codebook learning framework, called LG-VQ, which aims to learn a codebook that can be aligned with the text to improve the performance of multi-modal downstream tasks. Specifically, we first introduce pre-trained text semantics as prior knowledge, then design two novel alignment modules (i.e., Semantic Alignment Module, and Relationship Alignment Module) to transfer such prior knowledge into codes for achieving codebook text alignment. In particular, our LG-VQ method is model-agnostic, which can be easily integrated into existing VQ models. Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks.


Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs

Neural Information Processing Systems

Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pretraining method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at https://github.com/


Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation Claude Formanek 1,2 Louise Beyers 1 Jonathan Shock

Neural Information Processing Systems

Offline multi-agent reinforcement learning (MARL) is an emerging field with great promise for real-world applications. Unfortunately, the current state of research in offline MARL is plagued by inconsistencies in baselines and evaluation protocols, which ultimately makes it difficult to accurately assess progress, trust newly proposed innovations, and allow researchers to easily build upon prior work. In this paper, we firstly identify significant shortcomings in existing methodologies for measuring the performance of novel algorithms through a representative study of published offline MARL work. Secondly, by directly comparing to this prior work, we demonstrate that simple, well-implemented baselines can achieve stateof-the-art (SOTA) results across a wide range of tasks. Specifically, we show that on 35 out of 47 datasets used in prior work (almost 75% of cases), we match or surpass the performance of the current purported SOTA. Strikingly, our baselines often substantially outperform these more sophisticated algorithms. Finally, we correct for the shortcomings highlighted from this prior work by introducing a straightforward standardised methodology for evaluation and by providing our baseline implementations with statistically robust results across several scenarios, useful for comparisons in future work. Our proposal includes simple and sensible steps that are easy to adopt, which in combination with solid baselines and comparative results, could substantially improve the overall rigour of empirical science in offline MARL moving forward.


DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations (Supplementary Material) Ximeng Sun 1 Ping Hu1 Kate Saenko Boston University, 2

Neural Information Processing Systems

In this section, we provide the average per-class and average overall precisions (CP and OP), recalls (CR and oR) and F1 scores (CF1 and OF1) of DualCoOp in the experiment of MLR with Partial Labels on MS-COCO [3], VOC2007 [2] and BigEarth [1] (see Table 3, 4 and 5 in supplementary material) as a supplementary for Table?? and?? in the main paper. We have visualized the class-specific region feature aggregation on MS-COCO dataset (in Figure 1). We can see DualCoOp generates the high attention score at the correct objects. The pascal visual object classes (voc) challenge.


DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations Ximeng Sun 1 Ping Hu1 Kate Saenko Boston University, 2

Neural Information Processing Systems

Solving multi-label recognition (MLR) for images in the low-label regime is a challenging task that has many real-world applications. Recent work learns an alignment between textual and visual spaces to compensate for insufficient image labels, but loses accuracy because of the limited amount of available MLR annotations. In this work, we utilize the strong alignment of textual and visual features pretrained with millions of auxiliary image-text pairs and propose Dual Context Optimization (DualCoOp) as a unified framework for partial-label MLR and zero-shot MLR.


The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information Diyuan Wu1 Denis Kuznedelev 2,3

Neural Information Processing Systems

The rising footprint of machine learning has led to a focus on imposing model sparsity as a means of reducing computational and memory costs. For deep neural networks (DNNs), the state-of-the-art accuracy-vs-sparsity is achieved by heuristics inspired by the classical Optimal Brain Surgeon (OBS) framework [LeCun et al., 1989, Hassibi and Stork, 1992, Hassibi et al., 1993], which leverages loss curvature information to make better pruning decisions. Yet, these results still lack a solid theoretical understanding, and it is unclear whether they can be improved by leveraging connections to the wealth of work on sparse recovery algorithms. In this paper, we draw new connections between these two areas and present new sparse recovery algorithms inspired by the OBS framework that comes with theoretical guarantees under reasonable assumptions and have strong practical performance. Specifically, our work starts from the observation that we can leverage curvature information in OBS-like fashion upon the projection step of classic iterative sparse recovery algorithms such as IHT. We show for the first time that this leads both to improved convergence bounds under standard assumptions.


Supplementary Material for " Augmentation-Free Dense Contrastive Knowledge Distillation for Efficient Semantic Segmentation "

Neural Information Processing Systems

A.1 Datasets We evaluate our Af-DCD method on five mainstream semantic segmentation datasets following standard training/validation/test splits. Cityscapes [1] is a dataset for real-world semantic urban scene understanding. It has 5,000 image samples with high quality pixel-level annotations and 20,000 image samples with coarse annotations collected from 50 different cities. In semantic segmentation, only samples with pixel-level annotations are used, which contain 2,975 training samples, 500 validation samples and 1,525 testing samples, with totally 19 classes. Pascal VOC [2] is a competition dataset, whose samples are collected from the flickr2 photo-sharing website.