Inductive Learning
Reviews: Joint-task Self-supervised Learning for Temporal Correspondence
The paper presents a new approach to tracking and pixel level correspondence using self-supervised learning in video. It goes in the direction of multi-task learning. As well results are solid. The reviewers at the beginning gave a score of 5,6 and 7, than after rebuttal also the more skeptic reviewer was convinced to improve its rate. .
Reviews: Localized Structured Prediction
The model is learned by breaking the structure into parts and performing kernel ridge regression on the parts. They show elaborate convergence rate analysis in the estimation. The theoretical analysis is the strong part of this paper. In a lot of computer vision and NLP applications the latest research is about capturing long range dependencies. The correlation in Figure 1 is highly concentrated at the central patch because it's the average of many different images, but on individual images the correlation patten can be very different.
Reviews: Localized Structured Prediction
The authors propose a general theoretical framework for structured prediction that deals with cases where the data exhibits a local structure, so that the inputs and outputs can be decomposed into parts. The reviewers deemed the theoretical contributions to be of original and of a high quality. The author response addressed the perceived weaknesses, in particular in the empirical evaluation, in a satisfcatory way.
Review for NeurIPS paper: Rankmax: An Adaptive Projection Alternative to the Softmax Function
Strengths: * The paper is concerned with the derivation of k-argmax function's continuous approximation as a generic projection of a score vector onto the (n, 1)-simplex or the (n, k)-simplex (for predicting top-k relevant labels) based on a strongly convex function g . The first interesting contribution shows how to obtain such approximation and derives the general solution of this problem provided some properties of g (it is separable, 1-strongly convex). Relevant g s are quadratic function, negative entropy. Specifically, Euclidean projection with adapted Lipschitz constant \alpha of the projection to the training instance is devised as the Rankmax operator. The key element is that \alpha can be computed such that the sample's labels occurs in the top-k.
Review for NeurIPS paper: FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
I cite from ReMixMatch figure caption: "Augmentation anchoring. We use the prediction for a weakly augmented image (green, middle) as the target for predictions on strong augmentations of the same image". This sounds to me as a summary of the presented work, and as such I consider it a special case of the ReMixMatch. Authors have discussed the differences between their work and ReMixMatch, mentioning that (1) "ReMixMatch don t use pseudo labeling", and (2) ReMixMatch uses sharpening of pseudolabels and weight annealing of the unlabeled data loss. However, in section 3.2.1 of ReMixMatch, it is stated that the guessed labels are used as targets (for strongly augmented images) using cross-entropy loss.
Review for NeurIPS paper: One-bit Supervision for Image Classification
Additional Feedback: I consider this work as a new method in the context of semi-supervised learning and actively learning. Indeed, these are the two topics the authors of the paper reviewed as the related work to this work. The method essentially is yet another way to rearrange labeled samples and unlabeled samples in order to identify "active" samples to improve the learning accuracy. Thus, it is not an eye-opening, truly novel approach. I would argue that this method is incrementally novel at best.
Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities
Dai, Qirun, Zhang, Dylan, Ma, Jiaqi W., Peng, Hao
Selecting appropriate training data is crucial for effective instruction fine-tuning of large language models (LLMs), which aims to (1) elicit strong capabilities, and (2) achieve balanced performance across a diverse range of tasks. Influence-based methods show promise in achieving (1) by estimating the contribution of each training example to the model's predictions, but often struggle with (2). Our systematic investigation reveals that this underperformance can be attributed to an inherent bias where certain tasks intrinsically have greater influence than others. As a result, data selection is often biased towards these tasks, not only hurting the model's performance on others but also, counterintuitively, harms performance on these high-influence tasks themselves. As a remedy, we propose BIDS, a Balanced and Influential Data Selection algorithm. BIDS first normalizes influence scores of the training data, and then iteratively balances data selection by choosing the training example with the highest influence on the most underrepresented task. Experiments with both Llama-3 and Mistral-v0.3 on seven benchmarks spanning five diverse capabilities show that BIDS consistently outperforms both state-of-the-art influence-based algorithms and other non-influence-based selection frameworks. Surprisingly, training on a 15% subset selected by BIDS can even outperform full-dataset training with a much more balanced performance. Our analysis further highlights the importance of both instance-level normalization and iterative optimization of selected data for balanced learning of diverse capabilities.
Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos
Self-supervised learning holds the promise to learn good representations from real-world continuous uncurated data streams. However, most existing works in visual self-supervised learning focus on static images or artificial data streams. Towards exploring a more realistic learning substrate, we investigate streaming self-supervised learning from long-form real-world egocentric video streams. Inspired by the event segmentation mechanism in human perception and memory, we propose "Memory Storyboard" that groups recent past frames into temporal segments for more effective summarization of the past visual streams for memory replay. To accommodate efficient temporal segmentation, we propose a two-tier memory hierarchy: the recent past is stored in a short-term memory, and the storyboard temporal segments are then transferred to a long-term memory. Experiments on real-world egocentric video datasets including SAYCam and KrishnaCam show that contrastive learning objectives on top of storyboard frames result in semantically meaningful representations which outperform those produced by state-of-the-art unsupervised continual learning methods.
Reviews: Contextual semibandits via supervised learning oracles
This paper is very interesting in general, and I believe that it has met the standard of NIPS poster. In particular, to the best of my knowledge, this is the first paper considering contextual combinatorial semi-bandits with *unknown* weights. However, I think some parts of the paper can still be improved, and will appreciate it if the authors polish the final version of the paper accordingly: 1) In Theorem 2: the O(T {2/3}) regret bound is somewhat unsatisfactory since I am expecting an O(T {1/2}) regret bound. If the authors believe that the O(T {2/3}) regret bound is intrinsic, please discuss. If the authors believe that it is due to unsatisfactory analysis, please also discuss (i.e. which step of the analysis leads to this non-tight regret bound). Please rewrite the motivation and explanation of the algorithm.
Reviews: Sampling for Bayesian Program Learning
I found this paper interesting and well-written, but I have some significant questions and comments about the approach. The paper argues that sampling is useful because we can find the C most frequently sampled programs and show them to a user. As shown in Figure 6, there is more likely to be a correct program in the top 3 programs than in the top 1. But if we want to show the top C programs, do we really need to perform sampling, which the paper says is complicated by the existence of many long and unlikely programs that match the training examples? Why can't we simply find the MDL program and then run the solver again with length restrictions to find other consistent programs of the same length, or slightly longer lengths?