Well File:

DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations (Supplementary Material) Ximeng Sun 1 Ping Hu1 Kate Saenko Boston University, 2

Neural Information Processing Systems

In this section, we provide the average per-class and average overall precisions (CP and OP), recalls (CR and oR) and F1 scores (CF1 and OF1) of DualCoOp in the experiment of MLR with Partial Labels on MS-COCO [3], VOC2007 [2] and BigEarth [1] (see Table 3, 4 and 5 in supplementary material) as a supplementary for Table?? and?? in the main paper. We have visualized the class-specific region feature aggregation on MS-COCO dataset (in Figure 1). We can see DualCoOp generates the high attention score at the correct objects. The pascal visual object classes (voc) challenge.


DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations Ximeng Sun 1 Ping Hu1 Kate Saenko Boston University, 2

Neural Information Processing Systems

Solving multi-label recognition (MLR) for images in the low-label regime is a challenging task that has many real-world applications. Recent work learns an alignment between textual and visual spaces to compensate for insufficient image labels, but loses accuracy because of the limited amount of available MLR annotations. In this work, we utilize the strong alignment of textual and visual features pretrained with millions of auxiliary image-text pairs and propose Dual Context Optimization (DualCoOp) as a unified framework for partial-label MLR and zero-shot MLR.


The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information Diyuan Wu1 Denis Kuznedelev 2,3

Neural Information Processing Systems

The rising footprint of machine learning has led to a focus on imposing model sparsity as a means of reducing computational and memory costs. For deep neural networks (DNNs), the state-of-the-art accuracy-vs-sparsity is achieved by heuristics inspired by the classical Optimal Brain Surgeon (OBS) framework [LeCun et al., 1989, Hassibi and Stork, 1992, Hassibi et al., 1993], which leverages loss curvature information to make better pruning decisions. Yet, these results still lack a solid theoretical understanding, and it is unclear whether they can be improved by leveraging connections to the wealth of work on sparse recovery algorithms. In this paper, we draw new connections between these two areas and present new sparse recovery algorithms inspired by the OBS framework that comes with theoretical guarantees under reasonable assumptions and have strong practical performance. Specifically, our work starts from the observation that we can leverage curvature information in OBS-like fashion upon the projection step of classic iterative sparse recovery algorithms such as IHT. We show for the first time that this leads both to improved convergence bounds under standard assumptions.


Supplementary Material for " Augmentation-Free Dense Contrastive Knowledge Distillation for Efficient Semantic Segmentation "

Neural Information Processing Systems

A.1 Datasets We evaluate our Af-DCD method on five mainstream semantic segmentation datasets following standard training/validation/test splits. Cityscapes [1] is a dataset for real-world semantic urban scene understanding. It has 5,000 image samples with high quality pixel-level annotations and 20,000 image samples with coarse annotations collected from 50 different cities. In semantic segmentation, only samples with pixel-level annotations are used, which contain 2,975 training samples, 500 validation samples and 1,525 testing samples, with totally 19 classes. Pascal VOC [2] is a competition dataset, whose samples are collected from the flickr2 photo-sharing website.




A Appendix

Neural Information Processing Systems

A.1 Pseudocode for our search algorithm Our framework follows a standard search pipeline: 1. Candidate proposal: the search algorithm samples an optimizer from the search space. This procedure is commonly used in other AutoML domains, such as Neural Architecture Search [47, 67] and Hyperparameter Optimization [23]. Algorithm 1 and 2 summarize the complete search process. Input: Candidate set A, constraints C, operator set O, maximum super-tree depth D, maximum traversal level L, MC sample size M for each level, score threshold, proposal size K. Following NOS-RL, we use n =0.5 for cosine decay and n = 20 for restart decay. We set the bound for clip operator to 0.003, and the dropout ratio to 0.1 for drop operator. Note that one can always include more options of these values by adding new operator variants to the space (e.g. drop For all input operators, we use their default PyTorch implementations and hyper-parameters.


A Brief Review of The Shapley Value

Neural Information Processing Systems

Given a value function v, the Shapley value is a solution to distributing the payoff v(N) to parties in N [14]. Given an order of parties (i.e., a permutation ฯ€ of N), party i joins the coalition P The Shapley value is'fair' since it is the unique solution that satisfies several desirable properties as elaborated below. It ensures that all of v(N) are distributed to the parties. It implies parties with equal marginal contributions to any coalitions have the same payoff. A reward allocation scheme is replication-robust if a party cannot increase its rewards by replicating its data and participating in the collaboration as multiple parties.



Visual Riddles: a Commonsense and World Knowledge Challenge for Large Visionand Language Models

Neural Information Processing Systems

Imagine observing someone scratching their arm; to understand why, additional context would be necessary. However, spotting a mosquito nearby would immediately offer a likely explanation for the person's discomfort, thereby alleviating the need for further information. This example illustrates how subtle visual cues can challenge our cognitive skills and demonstrates the complexity of interpreting visual scenarios. To study these skills, we present Visual Riddles, a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge. The benchmark comprises 400 visual riddles, each featuring a unique image created by a variety of text-to-image models, question, groundtruth answer, textual hint, and attribution. Human evaluation reveals that existing models lag significantly behind human performance, which is at 82% accuracy, with Gemini-Pro-1.5 leading with 40% accuracy. Our benchmark comes with automatic evaluation tasks to make assessment scalable. These findings underscore the potential of Visual Riddles as a valuable resource for enhancing vision and language models' capabilities in interpreting complex visual scenarios.