Plotting


Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning Brandon Huang 1* Chancharik Mitra 1* Leonid Karlinsky 3

Neural Information Processing Systems

The recent success of interleaved Large Multimodal Models (LMMs) in fewshot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens.


Supplementary Material: SeafloorAI: A Large-scale Vision-Language Dataset for Seafloor Geological Survey Kien X. Nguyen 1

Neural Information Processing Systems

A.1 Motivation For what purpose was the dataset created? The dataset was created to further advance machine learning techniques in the field of marine science. Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? The dataset was created by the Deep-REAL and CSHEL labs at the University of Delaware. The sources of the data are from USGS and NOAA. Who funded the creation of the dataset? The Department of Defense funded the project under the DEPSCoR Award. A.2 Composition What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? An instance is a sonar image (2D grid data), containing different geographic layers, each of which is a channel of the image. How many instances are there in total (of each type, if appropriate)? Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?


SeafloorAI: A Large-scale Vision-Language Dataset for Seafloor Geological Survey 1 1

Neural Information Processing Systems

A major obstacle to the advancements of machine learning models in marine science, particularly in sonar imagery analysis, is the scarcity of AI-ready datasets. While there have been efforts to make AI-ready sonar image dataset publicly available, they suffer from limitations in terms of environment setting and scale. To bridge this gap, we introduce SeafloorAI, the first extensive AI-ready datasets for seafloor mapping across 5 geological layers that is curated in collaboration with marine scientists. We further extend the dataset to SeafloorGenAI by incorporating the language component in order to facilitate the development of both visionand language-capable machine learning models for sonar imagery. The dataset consists of 62 geo-distributed data surveys spanning 17,300 square kilometers, with 696K sonar images, 827K annotated segmentation masks, 696K detailed language descriptions and approximately 7M question-answer pairs. By making our data processing source code publicly available, we aim to engage the marine science community to enrich the data pool and inspire the machine learning community to develop more robust models. This collaborative approach will enhance the capabilities and applications of our datasets within both fields.


Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions Wei Jiang 1

Neural Information Processing Systems

This paper explores adaptive variance reduction methods for stochastic optimization based on the STORM technique. Existing adaptive extensions of STORM rely on strong assumptions like bounded gradients and bounded function values, or suffer an additional O(log T) term in the convergence rate.


UNION: Unsupervised 3D Object Detection using Object Appearance-based Pseudo-Classes

Neural Information Processing Systems

Unsupervised 3D object detection methods have emerged to leverage vast amounts of data without requiring manual labels for training. Recent approaches rely on dynamic objects for learning to detect mobile objects but penalize the detections of static instances during training. Multiple rounds of (self) training are used to add detected static instances to the set of training targets; this procedure to improve performance is computationally expensive. To address this, we propose the method UNION. We use spatial clustering and self-supervised scene flow to obtain a set of static and dynamic object proposals from LiDAR.


A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

Neural Information Processing Systems

Despite Retrieval-Augmented Generation (RAG) showing promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses and reliability of measurements.


2718a032d15e0b80cd164b240220df89-Paper-Conference.pdf

Neural Information Processing Systems

The generalization ability of deepfake detectors is vital for their applications in realworld scenarios. One effective solution to enhance this ability is to train the models with manually-blended data, which we termed "blendfake", encouraging models to learn generic forgery artifacts like blending boundary. Interestingly, current SoTA methods utilize blendfake without incorporating any deepfake data in their training process. This is likely because previous empirical observations suggest that vanilla hybrid training (VHT), which combines deepfake and blendfake data, results in inferior performance to methods using only blendfake data (so-called "1+1<2"). Therefore, a critical question arises: Can we leave deepfake behind and rely solely on blendfake data to train an effective deepfake detector? Intuitively, as deepfakes also contain additional informative forgery clues (e.g., deep generative artifacts), excluding all deepfake data in training deepfake detectors seems counter-intuitive.


Absorb & Escape: Overcoming Single Model Limitations in Generating Genomic Sequences

Neural Information Processing Systems

Recent advances in immunology and synthetic biology have accelerated the development of deep generative methods for DNA sequence design. Two dominant approaches in this field are AutoRegressive (AR) models and Diffusion Models (DMs). However, genomic sequences are functionally heterogeneous, consisting of multiple connected regions (e.g., Promoter Regions, Exons, and Introns) where elements within each region come from the same probability distribution, but the overall sequence is non-homogeneous. This heterogeneous nature presents challenges for a single model to accurately generate genomic sequences. In this paper, we analyze the properties of AR models and DMs in heterogeneous genomic sequence generation, pointing out crucial limitations in both methods: (i) AR models capture the underlying distribution of data by factorizing and learning the transition probability but fail to capture the global property of DNA sequences.


Transferring disentangled representations: bridging the gap between synthetic and real images

Neural Information Processing Systems

Developing meaningful and efficient representations that separate the fundamental structure of the data generation mechanism is crucial in representation learning. However, Disentangled Representation Learning has not fully shown its potential on real images, because of correlated generative factors, their resolution and limited access to ground truth labels. Specifically on the latter, we investigate the possibility of leveraging synthetic data to learn general-purpose disentangled representations applicable to real data, discussing the effect of fine-tuning and what properties of disentanglement are preserved after the transfer. We provide an extensive empirical study to address these issues. In addition, we propose a new interpretable intervention-based metric, to measure the quality of factors encoding in the representation. Our results indicate that some level of disentanglement, transferring a representation from synthetic to real data, is possible and effective.