Goto

Collaborating Authors

 Genre


Information-theoretic Generalization Analysis for VQ-VAEs: ARole of Latent Variables

Neural Information Processing Systems

Latent variables (LVs) play a crucial role in encoder-decoder models by enabling effective data compression, prediction, and generation. Although their theoretical properties, such as generalization, have been extensively studied in supervised learning, similar analyses for unsupervised models such as variational autoencoders (VAEs) remain insufficiently explored. In this work, we extend informationtheoretic generalization analysis to vector-quantized (VQ) VAEs with discrete latent spaces, introducing a novel data-dependent prior to rigorously analyze the relationship among LVs, generalization, and data generation. We derive a novel generalization error bound of the reconstruction loss of VQ-VAEs, which depends solely on the complexity of LVs and the encoder, independent of the decoder. Additionally, we provide the upper bound of the 2-Wasserstein distance between the distributions of the true data and the generated data, explaining how the regularization of the LVs contributes to the data generation performance.


Long term Consistent

Neural Information Processing Systems

World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WORLDMEM, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing state-aware memory attention that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps.


Towards Predicting Any Human Trajectory In Context

Neural Information Processing Systems

Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenariospecific data and performing fine-tuning via backpropagation. However, the need to fine-tune for each new scenario is often impractical for deployment on edge devices. To address this challenge, we introduce TrajICL, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables adaptation without fine-tuning on the scenario-specific data at inference time without requiring weight updates. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks.


Generating Creative Chess Puzzles

Neural Information Processing Systems

While Generative AI rapidly advances in various domains, generating truly creative, aesthetic, and counter-intuitive outputs remains a challenge. This paper presents an approach to tackle these difficulties in the domain of chess puzzles. We start by benchmarking Generative AI architectures, and then introduce an RL framework with novel rewards based on chess engine search statistics to overcome some of those shortcomings. The rewards are designed to enhance a puzzle's uniqueness, counter-intuitiveness, diversity, and realism. Our RL approach dramatically increases counter-intuitive puzzle generation by 10x, from 0.22% (supervised) to 2.5%, surpassing existing dataset rates (2.1%) and the best Lichess-trained model (0.4%).


Breaking the Order Barrier: Off-Policy Evaluation for Confounded POMDPs

Neural Information Processing Systems

We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs) with unobserved confounding. Recent advances have introduced bridge-function to circumvent unmeasured confounding and develop estimators for the policy value, yet the statistical error bounds of them related to the length of horizon T and the size of the state-action space |O||A| remain largely unexplored. In this paper, we systematically investigate the finite-sample error bounds of OPE estimators in finite-horizon tabular confounded POMDPs. Specifically, we show that under certain rank conditions, the estimation error for policy value can achieve a rate of O(T1.5/ n), excluding the cardinality of the observation space |O| and the action space |A|. With an additional mild condition on the concentrability coefficients in confounded POMDPs, the rate of estimation error can be improved to O(T/ n).


MEgo Hand Egocentric Hand Object Interaction Motion Generation

Neural Information Processing Systems

Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, selfocclusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level "cerebrum" leverages a vision language model (VLM) to infer motion priors from visual-textual context and a monocular depth estimator for object-agnostic spatial reasoning, while a low-level DiT-based flow-matching policy generates fine-grained trajectories with temporal orthogonal filtering to enhance stability.


CoTRed-Handed: Stress Testing Chain-of-Thought Monitoring

Neural Information Processing Systems

As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while CoT monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the CoT monitors, reducing performance in obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores model reasoning and actions, and combines them using a weighted average. Our hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates twice higher than action-only monitoring for subtle deception scenarios.


Cost-Efficient LLMTraining with Lifetime-Aware Tensor Offloading via GPUDirect Storage

Neural Information Processing Systems

We present the design and implementation of a new lifetime-aware tensor offloading framework for GPU memory expansion using low-cost PCIe-based solid-state drives (SSDs). Our framework, TERAIO, is developed explicitly for large language model (LLM) training with multiple GPUs and multiple SSDs. Its design is driven by our observation that the active tensors take only a small fraction (1.7% on average) of allocated GPU memory in each LLM training iteration, the inactive tensors are usually large and will not be used for a long period of time, creating ample opportunities for offloading/prefetching tensors to/from slow SSDs without stalling the GPU training process. TERAIO accurately estimates the lifetime (active period of time in GPU memory) of each tensor with the profiling of the first few iterations in the training process. With the tensor lifetime analysis, TERAIO will generate an optimized tensor offloading/prefetching plan and integrate it into the compiled LLM program via PyTorch. TERAIO has a runtime tensor migration engine to execute the offloading/prefetching plan via GPUDirect storage, which allows direct tensor migration between GPUs and SSDs for alleviating the CPU bottleneck and maximizing the SSD bandwidth utilization. In comparison with state-of-the-art studies such as ZeRO-Offload and ZeRO-Infinity, we show that TERAIO improves the training performance of various LLMs by 1.47 on average, and achieves 80.7% of the ideal performance assuming unlimited GPU memory.


PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

Neural Information Processing Systems

A penguin is standing on the lawn, with a giraffe behind it. A young man stands in front of the Statue of Liberty. A man in a tuxedo stands beside Tokyo Tower. A man is standing next to a traditional Japanese lantern. A woman is looking at a small, fluffy dog.


467655d26fcc207bca08915dc91964c6-Paper-Conference.pdf

Neural Information Processing Systems

World models are generative systems that learn to predict an environment in response to actions, making them well suited for simulating complex, interactive settings [28, 2, 30, 74, 90]. Video diffusion models [11, 37, 44, 79, 55] have emerged as a powerful approach to architecting world models, especially when used with autoregressive next-frame prediction [1, 12, 18, 22, 41, 53, 60, 65, 73, 81, 35]. Existing video generation models, however, often struggle with long-horizon consistency due to limited temporal context windows, frequently forgetting previously seen scenes during revisits. This is due to the relatively small number of previously generated context frames that the model can consider when generating new frames--a problem primarily caused by the quadratic growth of computational complexity in the attention module of the underlying diffusion transformers. To address this challenge, current world models simply keep the number of context frames low to maintain computational feasibility.