Goto

Collaborating Authors

 ground truth


Accelerating data-driven algorithm selection for combinatorial partitioning problems

Neural Information Processing Systems

Data-driven algorithm selection is a powerful approach for choosing effective heuristics for computational problems. It operates by evaluating a set of candidate algorithms on a collection of representative training instances and selecting the one with the best empirical performance. However, running each algorithm on every training instance is computationally expensive, making scalability a central challenge. In practice, a common workaround is to evaluate algorithms on smaller proxy instances derived from the original inputs. However, this practice has remained largely ad hoc and lacked theoretical grounding. We provide the first theoretical foundations for this practice by formalizing the notion of size generalization: predicting an algorithm's performance on a large instance by evaluating it on a smaller, representative instance, subsampled from the original instance. We provide size generalization guarantees for three widely used clustering algorithms (single-linkage, k-means++, and Gonzalez's k-centers heuristic) and two canonical max-cut algorithms (Goemans-Williamson and Greedy). We characterize the subsample size sufficient to ensure that performance on the subsample reflects performance on the full instance, and our experiments support these findings.



Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Neural Information Processing Systems

Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages.


Identifying interactions across brain areas while accounting for individual-neuron dynamics with a Transformer-based variational autoencoder

Neural Information Processing Systems

Advances in large-scale recording technologies now enable simultaneous measurements from multiple brain areas, offering new opportunities to study signal transmission across interacting components of neural circuits. However, neural responses exhibit substantial trial-to-trial variability, often driven by unobserved factors such as subtle changes in animal behavior or internal states. To prevent evolving background dynamics from contaminating identification of functional coupling, we developed a hybrid neural spike train model, GLM-Transformer, that incorporates flexible, deep latent variable models into a point process generalized linear model (GLM) having an interpretable component for cross-population interactions. ATransformer-based variational autoencoder captures nonstationary individual-neuron dynamics that vary across trials, while standard nonparametric regression GLM coupling terms provide estimates of directed interactions between neural populations. We incorporate a low-rank structure on population-topopulation coupling effects to improve scalability. Across synthetic datasets and mechanistic simulations, GLM-Transformer recovers known coupling structure and remains robust to shared background fluctuations. When applied to the Allen Institute Visual Coding dataset, it identifies feedforward pathways consistent with established visual hierarchy. This work offers a step toward improved identification of neural population interactions, and contributes to ongoing efforts aimed at achieving interpretable results while harvesting the benefits of deep learning.


UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection

Neural Information Processing Systems

The detection of ligand binding sites for proteins is a fundamental step in StructureBased Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-theart methods in ligand binding site detection.


InFlux: ABenchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras

Neural Information Processing Systems

Accurately tracking camera intrinsics is crucial for achieving 3D understanding from 2D video. However, most 3D algorithms assume that camera intrinsics stay constant throughout a video, which is often not true for many real-world in-the-wild videos. A major obstacle in this field is a lack of dynamic camera intrinsics benchmarks-existing benchmarks typically offer limited diversity in scene content and intrinsics variation, and none provide per-frame intrinsic changes for consecutive video frames. In this paper, we present Intrinsics in Flux (InFlux), a real-world benchmark that provides per-frame ground truth intrinsics annotations for videos with dynamic intrinsics. Compared to prior benchmarks, InFlux captures a wider range of intrinsic variations and scene diversity, featuring 143K+ annotated frames from 386 high-resolution indoor and outdoor videos with dynamic camera intrinsics. To ensure accurate per-frame intrinsics, we build a comprehensive lookup table of calibration experiments and extend the Kalibr toolbox to improve its accuracy and robustness. Using our benchmark, we evaluate existing baseline methods for predicting camera intrinsics and find that most struggle to achieve accurate predictions on videos with dynamic intrinsics. For the dataset, code, videos, and submission, please visit https://influx.cs.princeton.edu/.


Situat3DChange: Situated 3DChange Understanding Dataset for Multimodal Large Language Model (Supplementary Materials)

Neural Information Processing Systems

The data generation process includes situation sampling, long-form text generation, query generation for the long-form text, and QA generation. It is based on human observations of changes, object attributes, and allocentric object relationships in 3DSSG [9], as well as egocentric relationships between the human and the objects. A.1 Situation Sampling We follow the situation categories of MSQA [4], namely sitting, interacting, and standing, but with more detailed geometric analysis: Sitting. The 28seat categories in 3RScan [8] are grouped into four types: 3large seats with backrests (e.g., sofa), 16 small seats with backrests (e.g., armchair), 1 large seat without a backrest (bed), and 8small seats without backrests (e.g., beanbag). Seatable and backrest areas are classified by surface normals, or by nearby walls within 0.5 m if no backrest exists. For small seats, the seating point is the bounding box center, oriented away from the backrest. For large seats, we select a point with a backrest behind and open space (0.5-1 m) in front.



VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Neural Information Processing Systems

Vision-Language Models (VLMs) have achieved remarkable success in video understanding tasks. Yet, a key question remains: do they comprehend visual information, or merely learn superficial mappings between visual and textual patterns? Understanding visual cues, particularly those related to physics and common sense, is crucial for AI systems interacting with the physical world. However, existing VLM evaluations primarily rely on positivecontrol tests using real-world videos that resemble training distributions. While VLMs perform well on such benchmarks, it is unclear whether they grasp underlying visual and contextual signals or simply exploit visual-language correlations. To fill this gap, we propose incorporating negative-control tests, i.e., videos depicting physically impossible or logically inconsistent scenarios, and evaluating whether models can recognize these violations.


Diffusion StateSpaceDiffuser Ours

Neural Information Processing Systems

World models have recently gained prominence for action-conditioned visual prediction in complex environments. However, relying on only a few recent observations causes them to lose long-term context. Consequently, within a few steps, the generated scenes drift from what was previously observed, undermining temporal coherence. This limitation, common in state-of-the-art world models, which are diffusion-based, stems from the lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history.