Genre
Sketched Adaptive Distributed Deep Learning: ASharp Convergence Analysis
Combining gradient compression with adaptive optimizers is a highly desirable goal in distributed learning, with potential benefits in both fewer communication rounds and less per-round communication. In spite of preliminary empirical promise, certain major challenges in the convergence analysis of such methods have stayed open: handling compression based approximation of both first and second moments (pre-conditioner) which appear as a ratio; avoiding dependence on the number of parameters, which is extremely large in modern deep models; and providing high-probability guarantees instead of in-expectation, which can hide high variance behavior. In this work, we introduce a family of Sketched Adaptive Distributed Learning (SADL) algorithms which can use suitable unbiased gradient sketching for compression with suitable adaptive optimization algorithms. As our main contribution, we provide theoretical convergence guarantees of SADL algorithms which addresses all of the existing challenges. In particular, our guarantees hold with high probability, picks up only a logarithmic dependence on the number of parameters, and the first and second moment approximation is handled precisely yielding a dependence on the intrinsic dimension of the loss Hessian, which is significantly smaller than the full dimensionality of deep learning models. Empirically, the SADL algorithms are shown to be competitive with and often outperform baselines on both vision and language tasks, in both supervised fine-tuning and training-from-scratch regimes. Further, the SADL algorithms are also competitive with the state-of-the-art communication-efficient distributed learning algorithms based on error feedback.
307f375e35616bbc2861033966b44976-Paper-Conference.pdf
W Structural pix idely el-wise adopted Similarity fidelity ev b aluation ut Inde often x Measure metrics fail to capture for and sparse-vie Peak the completeness Signal-to-Noise w CT reconstruction--such of critical Ratio--prioritize anatomical as structures, this limitation, particularly we propose small a or suite thin of re no gions vel anatomy-a that are easily ware missed.
RoME Domain Robust Mixture of Experts for Solution Prediction across Domains
Mixed-Integer Linear Programming (MILP) is a fundamental and powerful framework for modeling complex optimization problems across diverse domains. Recently, learning-based methods have shown great promise in accelerating MILP solvers by predicting high-quality solutions. However, most existing approaches are developed and evaluated in single-domain settings, limiting their ability to generalize to unseen problem distributions. This limitation poses a major obstacle to building scalable and general-purpose learning-based solvers. To address this challenge, we introduce RoME, a domain-Robust Mixture-of-Experts framework for predicting MILP solutions across domains.
30697d9ef8ce55de6ccc38e043a94142-Paper-Datasets_and_Benchmarks_Track.pdf
Polarization images provide rich physical information that is fundamentally absent from standard RGB images, benefiting a wide range of computer vision applications such as reflection separation and material classification. However, the acquisition of polarization images typically requires additional optical components, which increases both the cost and the complexity of the applications. To bridge this gap, we introduce a new task: RGB-to-polarization image estimation, which aims to infer polarization information directly from RGB images. In this work, we establish the first comprehensive benchmark for this task by leveraging existing polarization datasets and evaluating a diverse set of state-of-the-art deep learning models, including both restoration-oriented and generative architectures. Through extensive quantitative and qualitative analysis, our benchmark not only establishes the current performance ceiling of RGB-to-polarization estimation, but also systematically reveals the respective strengths and limitations of different model families -- such as direct reconstruction versus generative synthesis, and task-specific training versus large-scale pre-training. In addition, we provide some potential directions for future research on polarization estimation. This benchmark is intended to serve as a foundational resource to facilitate the design and evaluation of future methods for polarization estimation from standard RGB inputs.
FlashMo: Geometric Interpolants and Frequency-Aware Sparsity for Scalable Efficient Motion Generation
Notably, recent progress in text-to-motion generation, particularly with autoregressive [100, 60, 59, 30, 85, 39, 92] and diffusion models [70, 93, 7, 94, 44, 73], has enabled the synthesis of natural human motion from natural language. While VQ-VAE-based autoregressive methods achieve outstanding quantitative results, they generate less natural motion with jitters due to frame-wise noise arising from directly decoding discrete tokens, and fine-grained motion details are sometimes lost during token discretization [13]. In contrast, motion diffusion models generate smoother and more realistic human motion, showing a promising trend in human motion generation [73, 74, 95]. However, despite their strengths, diffusion-based approaches still face two significant challenges, collectively limiting their applicability in real-world scenarios.
Eulerian Neural Network Informed by Chemical Transport for Air Quality Forecasting
Air pollution remains one of the most critical environmental challenges globally, posing severe threats to public health, ecological sustainability, and climate governance. While existing physics-based and data-driven models have made progress in air quality forecasting, they often struggle to jointly capture the complex spatiotemporal dynamics and ensure spatial continuity of pollutant distributions. In this study, we introduce CTENet, a novel chemical transport deep learning model that embeds the Advection-Diffusion-Reaction equation into a Physics-Informed Neural Network (PINN) framework using an Eulerian representation to model the spatiotemporal evolution of pollutants. Extensive experiments on two realworld datasets demonstrate that CTENet consistently outperforms state-of-the-art (SOTA) baselines, achieving a remarkable RMSE improvement of 45.8% on the USA dataset and 21.0% on the China dataset.
Let the LLMStick to Its Strengths: Learning to Route Economical LLM
Recently, test-time scaling of Large Language Models (LLMs) has emerged as a practical alternative to parameter and data scaling. Reasoning tasks often require large-scale, RLVR-based LLMs, while more economical LLMs can handle simpler tasks. Routing an LLM tailored to suitability (i.e., capability and cost) ensures usability and efficiency. We introduce LLMRec, which routes the most suitable LLM to the user query without pre-inference on the candidate LLM zoo.
Transformers Learn Faster with Semantic Focus
Various forms of sparse attention have been explored to mitigate the quadratic computational and memory cost of the attention mechanism in transformers. We study sparse transformers not through a lens of efficiency but rather in terms of learnability and generalization. Empirically studying a range of attention mechanisms, we find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models, while input-agnostic sparse attention models show no such benefits - a phenomenon that is robust across architectural and optimization hyperparameter choices. This can be interpreted as demonstrating that concentrating a model's "semantic focus" with respect to the tokens currently being considered (in the form of input-dependent sparse attention) accelerates learning. We develop a theoretical characterization of the conditions that explain this behavior. We establish a connection between the stability of the standard softmax and the loss function's Lipschitz properties, then show how sparsity affects the stability of the softmax and the subsequent convergence and generalization guarantees resulting from the attention mechanism. This allows us to theoretically establish that input-agnostic sparse attention does not provide any benefits. We also characterize conditions when semantic focus (input-dependent sparse attention) can provide improved guarantees, and we validate that these conditions are in fact met in our empirical evaluations.
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat "what to see" and "how to edit" separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentationaware perception and controllable object-centric generation within an end-to-end framework. FOCUS employs a dual-branch visual encoder to simultaneously capture global semantic context and fine-grained spatial details. In addition, we leverage a MoVQGAN-based visual tokenizer to produce discrete visual tokens that enhance generation quality. To enable accurate and controllable image editing, we propose a progressive multi-stage training pipeline, where segmentation masks are jointly optimized and used as spatial condition prompts to guide the diffusion decoder.
Real-Time Execution of Action Chunking Flow Policies
Modern AI systems, especially those interacting with the physical world, increasingly require real-time performance. However, the high latency of state-of-the-art generalist models, including recent vision-language-action models (VLAs), poses a significant challenge. While action chunking has enabled temporal consistency in high-frequency control tasks, it does not fully address the latency problem, leading to pauses or out-of-distribution jerky movements at chunk boundaries. This paper presents a novel inference-time algorithm that enables smooth asynchronous execution of action chunking policies. Our method, real-time chunking (RTC), is applicable to any diffusion-or flow-based VLA out of the box with no re-training. It generates the next action chunk while executing the current one, "freezing" actions guaranteed to execute and "inpainting" the rest. To test RTC, we introduce a new benchmark of 12 highly dynamic tasks in the Kinetix simulator, as well as evaluate 6 challenging real-world bimanual manipulation tasks. Results demonstrate that RTC is fast, performant, and uniquely robust to inference delay, significantly improving task throughput and enabling high success rates in precise tasks--such as lighting a match--even in the presence of significant latency.