Plotting

LG-VQ: Language-Guided Codebook Learning

Neural Information Processing Systems

Vector quantization (VQ) is a key technique in high-resolution and high-fidelity image synthesis, which aims to learn a codebook to encode an image with a sequence of discrete codes and then generate an image in an auto-regression manner. Although existing methods have shown superior performance, most methods prefer to learn a single-modal codebook (e.g., image), resulting in suboptimal performance when the codebook is applied to multi-modal downstream tasks (e.g., text-toimage, image captioning) due to the existence of modal gaps. In this paper, we propose a novel language-guided codebook learning framework, called LG-VQ, which aims to learn a codebook that can be aligned with the text to improve the performance of multi-modal downstream tasks. Specifically, we first introduce pre-trained text semantics as prior knowledge, then design two novel alignment modules (i.e., Semantic Alignment Module, and Relationship Alignment Module) to transfer such prior knowledge into codes for achieving codebook text alignment. In particular, our LG-VQ method is model-agnostic, which can be easily integrated into existing VQ models. Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks.


Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs

Neural Information Processing Systems

Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pretraining method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at https://github.com/


Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation Claude Formanek 1,2 Louise Beyers 1 Jonathan Shock

Neural Information Processing Systems

Offline multi-agent reinforcement learning (MARL) is an emerging field with great promise for real-world applications. Unfortunately, the current state of research in offline MARL is plagued by inconsistencies in baselines and evaluation protocols, which ultimately makes it difficult to accurately assess progress, trust newly proposed innovations, and allow researchers to easily build upon prior work. In this paper, we firstly identify significant shortcomings in existing methodologies for measuring the performance of novel algorithms through a representative study of published offline MARL work. Secondly, by directly comparing to this prior work, we demonstrate that simple, well-implemented baselines can achieve stateof-the-art (SOTA) results across a wide range of tasks. Specifically, we show that on 35 out of 47 datasets used in prior work (almost 75% of cases), we match or surpass the performance of the current purported SOTA. Strikingly, our baselines often substantially outperform these more sophisticated algorithms. Finally, we correct for the shortcomings highlighted from this prior work by introducing a straightforward standardised methodology for evaluation and by providing our baseline implementations with statistically robust results across several scenarios, useful for comparisons in future work. Our proposal includes simple and sensible steps that are easy to adopt, which in combination with solid baselines and comparative results, could substantially improve the overall rigour of empirical science in offline MARL moving forward.


DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations (Supplementary Material) Ximeng Sun 1 Ping Hu1 Kate Saenko Boston University, 2

Neural Information Processing Systems

In this section, we provide the average per-class and average overall precisions (CP and OP), recalls (CR and oR) and F1 scores (CF1 and OF1) of DualCoOp in the experiment of MLR with Partial Labels on MS-COCO [3], VOC2007 [2] and BigEarth [1] (see Table 3, 4 and 5 in supplementary material) as a supplementary for Table?? and?? in the main paper. We have visualized the class-specific region feature aggregation on MS-COCO dataset (in Figure 1). We can see DualCoOp generates the high attention score at the correct objects. The pascal visual object classes (voc) challenge.


DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations Ximeng Sun 1 Ping Hu1 Kate Saenko Boston University, 2

Neural Information Processing Systems

Solving multi-label recognition (MLR) for images in the low-label regime is a challenging task that has many real-world applications. Recent work learns an alignment between textual and visual spaces to compensate for insufficient image labels, but loses accuracy because of the limited amount of available MLR annotations. In this work, we utilize the strong alignment of textual and visual features pretrained with millions of auxiliary image-text pairs and propose Dual Context Optimization (DualCoOp) as a unified framework for partial-label MLR and zero-shot MLR.


The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information Diyuan Wu1 Denis Kuznedelev 2,3

Neural Information Processing Systems

The rising footprint of machine learning has led to a focus on imposing model sparsity as a means of reducing computational and memory costs. For deep neural networks (DNNs), the state-of-the-art accuracy-vs-sparsity is achieved by heuristics inspired by the classical Optimal Brain Surgeon (OBS) framework [LeCun et al., 1989, Hassibi and Stork, 1992, Hassibi et al., 1993], which leverages loss curvature information to make better pruning decisions. Yet, these results still lack a solid theoretical understanding, and it is unclear whether they can be improved by leveraging connections to the wealth of work on sparse recovery algorithms. In this paper, we draw new connections between these two areas and present new sparse recovery algorithms inspired by the OBS framework that comes with theoretical guarantees under reasonable assumptions and have strong practical performance. Specifically, our work starts from the observation that we can leverage curvature information in OBS-like fashion upon the projection step of classic iterative sparse recovery algorithms such as IHT. We show for the first time that this leads both to improved convergence bounds under standard assumptions.


Supplementary Material for " Augmentation-Free Dense Contrastive Knowledge Distillation for Efficient Semantic Segmentation "

Neural Information Processing Systems

A.1 Datasets We evaluate our Af-DCD method on five mainstream semantic segmentation datasets following standard training/validation/test splits. Cityscapes [1] is a dataset for real-world semantic urban scene understanding. It has 5,000 image samples with high quality pixel-level annotations and 20,000 image samples with coarse annotations collected from 50 different cities. In semantic segmentation, only samples with pixel-level annotations are used, which contain 2,975 training samples, 500 validation samples and 1,525 testing samples, with totally 19 classes. Pascal VOC [2] is a competition dataset, whose samples are collected from the flickr2 photo-sharing website.




A Appendix

Neural Information Processing Systems

A.1 Pseudocode for our search algorithm Our framework follows a standard search pipeline: 1. Candidate proposal: the search algorithm samples an optimizer from the search space. This procedure is commonly used in other AutoML domains, such as Neural Architecture Search [47, 67] and Hyperparameter Optimization [23]. Algorithm 1 and 2 summarize the complete search process. Input: Candidate set A, constraints C, operator set O, maximum super-tree depth D, maximum traversal level L, MC sample size M for each level, score threshold, proposal size K. Following NOS-RL, we use n =0.5 for cosine decay and n = 20 for restart decay. We set the bound for clip operator to 0.003, and the dropout ratio to 0.1 for drop operator. Note that one can always include more options of these values by adding new operator variants to the space (e.g. drop For all input operators, we use their default PyTorch implementations and hyper-parameters.