Plotting

Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation

Neural Information Processing Systems

Fine-grained visual classification (FGVC) involves classifying closely related sub-classes. This task is difficult due to the subtle differences between classes and the high intra-class variance. Moreover, FGVC datasets are typically small and challenging to gather, thus highlighting a significant need for effective data augmentation. Recent advancements in text-to-image diffusion models offer new possibilities for augmenting classification datasets. While these models have been used to generate training data for classification tasks, their effectiveness in fulldataset training of FGVC models remains under-explored.


Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

Neural Information Processing Systems

Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more "relevant" tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, a small (1-layer) Transformer trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67 of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as N 2 multiplication and a two-dimensional task.


Rethinking the Power of Timestamps for Robust Time Series Forecasting: A Global-Local Fusion Perspective Qi Qi1

Neural Information Processing Systems

Time series forecasting has played a pivotal role across various industries, including finance, transportation, energy, healthcare, and climate. Due to the abundant seasonal information they contain, timestamps possess the potential to offer robust global guidance for forecasting techniques. However, existing works primarily focus on local observations, with timestamps being treated merely as an optional supplement that remains underutilized. When data gathered from the real world is polluted, the absence of global information will damage the robust prediction capability of these algorithms. To address these problems, we propose a novel framework named GLAFF. Within this framework, the timestamps are modeled individually to capture the global dependencies. Working as a plugin, GLAFF adaptively adjusts the combined weights for global and local information, enabling seamless collaboration with any time series forecasting backbone. Extensive experiments conducted on nine real-world datasets demonstrate that GLAFF significantly enhances the average performance of widely used mainstream forecasting models by 12.5%, surpassing the previous state-of-the-art method by 5.5%.



Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning Brandon Huang 1* Chancharik Mitra 1* Leonid Karlinsky 3

Neural Information Processing Systems

The recent success of interleaved Large Multimodal Models (LMMs) in fewshot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens.


Supplementary Material: SeafloorAI: A Large-scale Vision-Language Dataset for Seafloor Geological Survey Kien X. Nguyen 1

Neural Information Processing Systems

A.1 Motivation For what purpose was the dataset created? The dataset was created to further advance machine learning techniques in the field of marine science. Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? The dataset was created by the Deep-REAL and CSHEL labs at the University of Delaware. The sources of the data are from USGS and NOAA. Who funded the creation of the dataset? The Department of Defense funded the project under the DEPSCoR Award. A.2 Composition What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? An instance is a sonar image (2D grid data), containing different geographic layers, each of which is a channel of the image. How many instances are there in total (of each type, if appropriate)? Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?


SeafloorAI: A Large-scale Vision-Language Dataset for Seafloor Geological Survey 1 1

Neural Information Processing Systems

A major obstacle to the advancements of machine learning models in marine science, particularly in sonar imagery analysis, is the scarcity of AI-ready datasets. While there have been efforts to make AI-ready sonar image dataset publicly available, they suffer from limitations in terms of environment setting and scale. To bridge this gap, we introduce SeafloorAI, the first extensive AI-ready datasets for seafloor mapping across 5 geological layers that is curated in collaboration with marine scientists. We further extend the dataset to SeafloorGenAI by incorporating the language component in order to facilitate the development of both visionand language-capable machine learning models for sonar imagery. The dataset consists of 62 geo-distributed data surveys spanning 17,300 square kilometers, with 696K sonar images, 827K annotated segmentation masks, 696K detailed language descriptions and approximately 7M question-answer pairs. By making our data processing source code publicly available, we aim to engage the marine science community to enrich the data pool and inspire the machine learning community to develop more robust models. This collaborative approach will enhance the capabilities and applications of our datasets within both fields.


Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions Wei Jiang 1

Neural Information Processing Systems

This paper explores adaptive variance reduction methods for stochastic optimization based on the STORM technique. Existing adaptive extensions of STORM rely on strong assumptions like bounded gradients and bounded function values, or suffer an additional O(log T) term in the convergence rate.


UNION: Unsupervised 3D Object Detection using Object Appearance-based Pseudo-Classes

Neural Information Processing Systems

Unsupervised 3D object detection methods have emerged to leverage vast amounts of data without requiring manual labels for training. Recent approaches rely on dynamic objects for learning to detect mobile objects but penalize the detections of static instances during training. Multiple rounds of (self) training are used to add detected static instances to the set of training targets; this procedure to improve performance is computationally expensive. To address this, we propose the method UNION. We use spatial clustering and self-supervised scene flow to obtain a set of static and dynamic object proposals from LiDAR.


A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

Neural Information Processing Systems

Despite Retrieval-Augmented Generation (RAG) showing promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses and reliability of measurements.