Genre
On the Convergence of Single-Timescale Actor-Critic
We analyze the global convergence of the single-timescale actor-critic (AC) algorithm for the infinite-horizon discounted Markov Decision Processes (MDPs) with finite state spaces. To this end, we introduce an elegant analytical framework for handling complex, coupled recursions inherent in the algorithm. Leveraging this framework, we establish that the algorithm converges to an ฯต-close globally optimal policy with a sample complexity of O(ฯต 3). This significantly improves upon the existing complexity of O(ฯต 2)to achieve ฯต-close stationary policy, which is equivalent to the complexity of O(ฯต 4)to achieve ฯต-close globally optimal policy using gradient domination lemma.
GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation
Retrieval-augmented generation (RAG) has proven effective in integrating knowledge into large language models (LLMs). However, conventional RAGs struggle to capture complex relationships between pieces of knowledge, limiting their performance in intricate reasoning that requires integrating knowledge from multiple sources. Recently, graph-enhanced retrieval augmented generation (GraphRAG) builds graph structure to explicitly model these relationships, enabling more effective and efficient retrievers. Nevertheless, its performance is still hindered by the noise and incompleteness within the graph structure. To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for retrieval augmented generation. GFM-RAG is powered by an innovative graph neural network that reasons over graph structure to capture complex query-knowledge relationships.
PlanarGS: High-Fidelity Indoor 3DGaussian Splatting Guided by Vision-Language Planar Priors
Three-dimensional Gaussian Splatting (3DGS) has recently emerged as an efficient representation for novel-view synthesis, achieving impressive visual quality. However, in scenes dominated by large and low-texture regions, common in indoor environments, the photometric loss used to optimize 3DGS yields ambiguous geometry and fails to recover high-fidelity 3D surfaces. To overcome this limitation, we introduce PlanarGS, a 3DGS-based framework tailored for indoor scene reconstruction. Specifically, we design a pipeline for Language-Prompted Planar Priors (LP3) that employs a pretrained vision-language segmentation model and refines its region proposals via cross-view fusion and inspection with geometric priors.
AREAL: ALarge-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers from severe system-level inefficiency. Generation must wait until the longest output in the batch is completed before model update, resulting in GPU underutilization.
VideoCAD: ADataset and Model for Learning Long-Horizon 3DCADUIInteractions from Video
Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools. In this work, we introduce VideoCAD, the first attempt to model UI interactions for precision engineering tasks. Specifically, VIDEOCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs. Compared to existing datasets, VIDEOCAD offers an order-of-magnitude increase in complexity for real-world engineering UI tasks, with time horizons up to 20 longer than those in other datasets. We show two important downstream applications of VIDEOCAD: (1) learning UI interactions from professional 3DCAD tools for precision tasks and (2) a visual question-answering (VQA) benchmark designed to evaluate multimodal large language models (LLMs) on spatial reasoning and video understanding. To learn the UI interactions, we propose VIDEOCADFORMER, a state-of-the-art model for learning CAD interactions directly from video, which outperforms existing behavior cloning baselines. Both VIDEOCADFORMER and the VQA benchmark derived from VIDEOCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multi-modal and spatial reasoning, and long-horizon dependencies.
Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation
We present MGAudio, a novel flow-based framework for open-domain video-toaudio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifierfree guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudioachieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics.
Exploiting LLMs for Automatic Hypothesis Assessment via a Based Calibrated Prior
As hypothesis generation becomes increasingly automated, a new bottleneck has emerged: hypothesis assessment. Modern systems can surface thousands of statistical relationships-correlations, trends, causal links-but offer little guidance on which ones are novel, non-trivial, or worthy of expert attention. In this work, we study the complementary problem to hypothesis generation: automatic hypothesis assessment. Specifically, we ask-given a large set of statistical relationships, can we automatically assess which ones are novel and worth further exploration? We focus on correlations as they are a common entry point in exploratory data analysis that often serve as the basis for forming deeper scientific or causal hypotheses.
Mixture of Scope Experts at Test Generalizing Deeper Graph Neural Networks with Shallow Variants
Heterophilous graphs, where dissimilar nodes tend to connect, pose a challenge for graph neural networks (GNNs). Increasing the GNN depth can expand the scope (i.e., receptive field), potentially finding homophily from the higher-order neighborhoods. However, GNNs suffer from performance degradation as depth increases. Despite having better expressivity, state-of-the-art deeper GNNs achieve only marginal improvements compared to their shallow variants. Through theoretical and empirical analysis, we systematically demonstrate a shift in GNN generalization preferences across nodes with different homophily levels as depth increases. This creates a disparity in generalization patterns between GNN models with varying depth. Based on these findings, we propose to improve deeper GNN generalization while maintaining high expressivity by Mixture of scope experts at test (Moscat). Experimental results show that Moscat works flexibly with various GNNs across a wide range of datasets while significantly improving accuracy. Our code is available at https://github.com/Hydrapse/moscat.
Self-Verification Provably Prevents Model Collapse in Recursive Synthetic Training
Large generative models are increasingly trained on synthetic data from earlier generations, raising concerns about model collapse, a progressive performance decline consistently observed in empirical studies. However, theoretical understanding of recursive training dynamics and their failure modes remains limited. In this work, we theoretically show that recursive training inherently leads to exponential error growth unless mitigated by sufficient real data. Addressing the growing scarcity of real data, we introduce a self-verification mechanism enabling models to filter their outputs based on internal confidence scores without external validation. Through rigorous analysis, we derive finite-sample error bounds demonstrating that self-verification alone can prevent collapse, even in fully synthetic training regimes. Our theoretical framework extends to large language models (LLMs), characterizing the conditions under which recursive training can maintain stability without performance degradation.
A natural photo, the background is beach, sand, treeThe ruins of a terrible war, 4kIn the field, strong afternoon light, 4k
We introduce a model named DreamLight for universal image relighting in this work, which can seamlessly composite subjects into a new background while maintaining aesthetic uniformity in terms of lighting and color tone. The background can be specified by natural images (image-based relighting) or generated from unlimited text prompts (text-based relighting). Existing studies primarily focus on image-based relighting, while with scant exploration into text-based scenarios. Some works employ intricate disentanglement pipeline designs relying on environment maps to provide relevant information, which grapples with the expensive data cost required for intrinsic decomposition and light source. Other methods take this task as an image translation problem and perform pixel-level transformation with autoencoder architecture.