Goto

Collaborating Authors

 Industry


Sketch-Augmented Features Improve Learning Long-Range Dependencies in Graph Neural Networks

Neural Information Processing Systems

Graph Neural Networks learn on graph-structured data by iteratively aggregating local neighborhood information. While this local message passing paradigm imparts a powerful inductive bias and exploits graph sparsity, it also yields three key challenges: (i) oversquashing of long-range information, (ii) oversmoothing of node representations, and (iii) limited expressive power. In this work we inject randomized global embeddings of node features, which we term Sketched Random Features, into standard GNNs, enabling them to efficiently capture long-range dependencies. The embeddings are unique, distance-sensitive, and topology-agnostic--properties which we analytically and empirically show alleviate the aforementioned limitations when injected into GNNs. Experimental results on real-world graph learning tasks confirm that this strategy consistently improves performance over baseline GNNs, offering both a standalone solution and a complementary enhancement to existing techniques such as graph positional encodings.


SOFAR: Language-Grounded Orientation Bridges Spatial Reasoningand Object Manipulation

Neural Information Processing Systems

While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation--a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the "plug-in" direction of a USB or the "handle" direction of a cup). To support this, we construct OrienText300K, a large-scale dataset of 3D objects annotated with semantic orientations, and develop PointSO, a general model for zero-shot semantic orientation prediction. By integrating semantic orientation into VLM agents, our SOFAR framework enables 6-DoF spatial reasoning and generates robotic actions. Extensive experiments demonstrated the effectiveness and generalization of our SOFAR, e.g., zero-shot 48.7% successful rate on Open6DOR and zero-shot 74.9% successful rate on SIMPLER-Env.


Learning Efficient Fuse-and-Refine for Feed-Forward 3DGaussian Splatting

Neural Information Processing Systems

Recent advances in feed-forward 3DGaussian Splatting have led to rapid improvements in efficient scene reconstruction from sparse views. However, most existing approaches construct Gaussian primitives directly aligned with the pixels in one or more of the input images. This leads to redundancies in the representation when input views overlap and constrains the position of the primitives to lie along the input rays without full flexibility in 3D space. Moreover, these pixel-aligned approaches do not naturally generalize to dynamic scenes, where effectively leveraging temporal information requires resolving both redundant and newly appearing content across frames. To address these limitations, we introduce a novel Fuseand-Refine module that enhances existing feed-forward models by merging and refining the primitives in a canonical 3D space.


Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment

Neural Information Processing Systems

Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time.


TimeEmb: ALightweight Static-Dynamic Disentanglement Framework for Time Series Forecasting

Neural Information Processing Systems

Temporal non-stationarity, the phenomenon that time series distributions change over time, poses fundamental challenges to reliable time series forecasting. Intuitively, the complex time series can be decomposed into two factors, i.e., timeinvariant and time-varying components, which indicate static and dynamic patterns, respectively. Nonetheless, existing methods often conflate the time-varying and time-invariant components, and jointly learn the combined long-term patterns and short-term fluctuations, leading to suboptimal performance facing distribution shifts. To address this issue, we initiatively propose a lightweight static-dynamic decomposition framework, TimeEmb, for time series forecasting. TimeEmb innovatively separates time series into two complementary components: (1) time-invariant component, captured by a novel global embedding module that learns persistent representations across time series, and (2) time-varying component, processed by an efficient frequency-domain filtering mechanism inspired by full-spectrum analysis in signal processing. Experiments on real-world datasets demonstrate that TimeEmb outperforms state-of-the-art baselines and requires fewer computational resources. We conduct comprehensive quantitative and qualitative analyses to verify the efficacy of static-dynamic disentanglement. This lightweight framework can also improve existing time-series forecasting methods with simple integration.


Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models

Neural Information Processing Systems

Humans are able to recognize objects based on both local texture cues and the configuration of object parts, yet contemporary vision models primarily harvest local texture cues, yielding brittle, non-compositional features. Work on shape-vstexture bias has pitted shape and texture representations in opposition, measuring shape relative to texture, ignoring the possibility that models (and humans) can simultaneously rely on both types of cues, and obscuring the absolute quality of both types of representation. We therefore recast shape evaluation as a matter of absolute configural competence, operationalized by the Configural Shape Score (CSS), which (i) measures the ability to recognize both images in Object-Anagram pairs that preserve local texture while permuting global part arrangement to depict different object categories. Across 86 convolutional, transformer, and hybrid models, CSS (ii) uncovers a broad spectrum of configural sensitivity with fully selfsupervised and language-aligned transformers - exemplified by DINOv2, SigLIP2 and EVA-CLIP - occupying the top end of the CSS spectrum. Mechanistic probes reveal that (iii) high-CSS networks depend on long-range interactions: radiuscontrolled attention masks abolish performance showing a distinctive U-shaped integration profile, and representational-similarity analyses expose a mid-depth transition from local to global coding. ABagNet control, whose receptive fields straddle patch seams, remains at chance (iv), ruling out any "border-hacking" strategies. Finally, (v) we show that configural shape score also predicts other shapedependent evals (e.g.,foreground bias, spectral and noise robustness). Overall, we propose that the path toward truly robust, generalizable, and human-like vision systems may not lie in forcing an artificial choice between shape and texture, but rather in architectural and learning frameworks that seamlessly integrate both local-texture and global configural shape. 1


MuRating: AHigh Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Neural Information Processing Systems

Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English, neglecting other languages that are essential in the training mix for multilingual LLMs. We introduce MuRating, a scalable framework that transfers high-quality English dataquality signals into a multilingual autorater, capable of handling 17 languages. MuRating aggregates multiple English autoraters via pairwise comparisons to learn unified document quality scores, then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain LLaMA-architecture models of 1.2B and 7B parameters. Compared to strong baselines, including QuRater, FineWeb2HQ, AskLLM, DCLM, our approach increases average accuracy on both English benchmarks and multilingual evaluations. Extensive analyses further validate that pairwise training provides greater stability and robustness than pointwise scoring, underscoring the effectiveness of MuRating as a general multilingual data-selection framework.



Confusion-Driven Self-Supervised Progressively Weighted Ensemble Learning for Non-Exemplar Class Incremental Learning

Neural Information Processing Systems

Non-exemplar class incremental learning (NECIL) aims to continuously assimilate new knowledge while retaining previously acquired knowledge in scenarios where prior examples are unavailable.


VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Neural Information Processing Systems

Vision-Language Models (VLMs) have achieved remarkable success in video understanding tasks. Yet, a key question remains: do they comprehend visual information, or merely learn superficial mappings between visual and textual patterns? Understanding visual cues, particularly those related to physics and common sense, is crucial for AI systems interacting with the physical world. However, existing VLM evaluations primarily rely on positivecontrol tests using real-world videos that resemble training distributions. While VLMs perform well on such benchmarks, it is unclear whether they grasp underlying visual and contextual signals or simply exploit visual-language correlations. To fill this gap, we propose incorporating negative-control tests, i.e., videos depicting physically impossible or logically inconsistent scenarios, and evaluating whether models can recognize these violations.