Goto

Collaborating Authors

 Industry


STSBENCH: ALarge-Scale Dataset for Modeling Neuronal Activity in the Dorsal Stream of Primate Visual Cortex

Neural Information Processing Systems

The primate visual system is typically divided into two streams -- the ventral stream, responsible for object recognition, and the dorsal stream, responsible for encoding spatial relations and motion. Recent studies have shown that convolutional neural networks (CNNs) pretrained on object recognition tasks are remarkably effective at predicting neuronal responses in the ventral stream, shedding light on the neural mechanisms underlying object recognition. However, similar models of the dorsal stream remain underdeveloped due to the lack of large scale datasets encompassing dorsal stream areas. To address this gap, we present STSBENCH, a dataset of large-scale, single neuron recordings from over 2,000 neurons in the superior temporal sulcus (STS), a nearly 50-fold increase over existing dorsal stream datasets, collected while Rhesus macaques viewed thousands of unique, natural videos. We show that our dataset can be used for benchmarking encoding models of dorsal stream neuronal responses and reconstructing visual input from neural activity.


VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Neural Information Processing Systems

Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent vision-language-action models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of directly transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy--forecasting both actions and their visual consequences--explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.


MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver

Neural Information Processing Systems

Multi-Task Learning (MTL) in Neural Combinatorial Optimization (NCO) is a promising approach for training a unified model capable of solving multiple Vehicle Routing Problem (VRP) variants. However, existing Reinforcement Learning (RL)-based multi-task methods can only train light decoder models on small-scale problems, exhibiting limited generalization ability when solving large-scale problems. To overcome this limitation, this work introduces a novel multi-task learning method driven by knowledge distillation (MTL-KD), which enables efficient training of heavy decoder models with strong generalization ability. The proposed MTL-KD method transfers policy knowledge from multiple distinct RL-based single-task models to a single heavy decoder model, facilitating label-free training and effectively improving the model's generalization ability across diverse tasks. In addition, we introduce a flexible inference strategy termed Random Reordering Re-Construction (R3C), which is specifically adapted for diverse VRP tasks and further boosts the performance of the multi-task model. Experimental results on 6 seen and 10 unseen VRP variants with up to 1,000 nodes indicate that our proposed method consistently achieves superior performance on both uniform and real-world benchmarks, demonstrating robust generalization abilities.


Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-based Decoding

Neural Information Processing Systems

Diffusion models excel at capturing the natural design spaces of images, molecules, and biological sequences. However, for many applications, rather than merely generating designs that are natural, we aim to optimize downstream reward functions while preserving the naturalness of these design spaces. Existing methods for achieving this goal often require "differentiable" proxy models (e.g., classifier guidance) or computationally-expensive fine-tuning of diffusion models (e.g., classifier-free guidance, RL-based fine-tuning). Here, we propose a new method, Soft Value-based Decoding in Diffusion models (SVDD), to address these challenges. SVDD is an iterative sampling method that integrates soft value functions, which looks ahead to how intermediate noisy states lead to high rewards in the future, into the standard inference procedure of pre-trained diffusion models. Notably, SVDD avoids fine-tuning generative models and eliminates the need to construct differentiable models. This enables us to (1) directly use non-differentiable features/reward feedback, commonly used in many scientific domains, and (2) apply our method to recent discrete diffusion models in a principled way. Finally, we demonstrate the effectiveness of SVDD across several domains, including image generation, molecule generation (optimization of docking scores, QED, SA), and DNA/RNA generation (optimization of activity levels). The code is available at https://github.com/masa-ue/SVDD.


894403f9604374a7a003063e480f65b9-Paper-Conference.pdf

Neural Information Processing Systems

Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of retrieval and copying tasks inspired by Liu et al. [2024a]. We use a recently proposed framework for studying length generalization [Huang et al., 2025] to provide guarantees for each of our settings.


The Growing Political Power of Anti-Data Center Activists

TIME - Tech

Follow this section to personalize your feed and get instant alerts. Follow Go to your personalized feed WHY FOLLOW? Smart Alerts: Get notified about major news as it happens. Follow this tag to personalize your feed and get instant alerts. Follow Go to your personalized feed WHY FOLLOW? Smart Alerts: Get notified about major news as it happens.



GTA 6 - all you need to know about Rockstar's blockbuster game

BBC News

GTA 6 - all you need to know about Rockstar's blockbuster game The latest instalment in Rockstar's blockbuster game franchise, Grand Theft Auto, is set to be the biggest games launch of the year. Details are still scant, although we do now know that GTA 6 will be available to pre-order on 25 June, the developer has announced . Analysts believe Rockstar's action adventure could become the most expensive game ever made, with estimates putting development costs at more than $1bn (£866m). We're still awaiting some crucial information about the game - but here's what we do and don't know about GTA 6 so far. When is GTA 6 coming out?


Uncover Governing Law of Pathology Propagation Mechanism Through AMean-Field Game

Neural Information Processing Systems

Alzheimer's disease (AD) is marked by cognitive decline along with the widespread of tau aggregates across the brain cortex. Due to the challenges of imaging pathology spreading flows in vivo, however, quantitative analysis on the cortical pathways of tau propagation and its interaction with the cascade of amyloid-beta (Aβ) plaques lags behind the experimental insights of underlying pathophysiological mechanisms. To address this challenge, we present a physics-informed neural network, empowered by mean-field theory, to uncover the biologically meaningful spreading pathways of tau aggregates between two longitudinal snapshots. Following the notion of'prion-like' mechanism in AD, we first formulate the dynamics of tau propagation as a mean-field game (MFG), where the spread of tau aggregate at each location (aka.


VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Neural Information Processing Systems

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token reduction, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving.