AITopics | Genre

Collaborating Authors

Genre

AGUVIS-7BUI-TARS-7BOS-Atlas-7BUGround-7BSeeClick+VerifierGUI-Actor-7BUI-TARS-2BGUI-Actor-2BShowUI-2BAriaUI-3.9BUGround-2B+Verifier

Neural Information Processing SystemsJun-15-2026, 02:34:04 GMT

One of the principal challenges in building VLM-powered GUI agents is visual grounding--localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment due to lack of explicit spatial supervision; inability to handle ambiguous supervision targets, as singlepoint predictions penalize valid variations; and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actorintroduces an attention-based action head that learns to align a dedicated token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actoroutperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B achieves scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones, outperforming UI-TARS72B (38.1) on ScreenSpot-Pro, with significantly fewer parameters and training data. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head ( 100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.

arxiv preprint arxiv, large language model, machine learning, (21 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Bridging Human and LLMJudgments: Understanding and Narrowing the Gap

Neural Information Processing SystemsJun-15-2026, 02:32:37 GMT

Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments.

density 0, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Señorita-2M: AHigh-Quality Instruction-based Dataset for General Video Editing by Video Specialists

Neural Information Processing SystemsJun-15-2026, 02:32:23 GMT

Video content editing has a wide range of applications. With the advancement of diffusion-based generative models, video editing techniques have made remarkable progress, yet they still remain far from practical usability. Existing inversion-based video editing methods are time-consuming and struggle to maintain consistency in unedited regions. Although instruction-based methods have high theoretical potential, they face significant challenges in constructing high-quality training datasets - current datasets suffer from issues such as editing correctness, frame consistency, and sample diversity. To bridge these gaps, we introduce the Señorita2M dataset, a large-scale, diverse, and high-quality video editing dataset.

art style, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Leisure & Entertainment (1.00)
Media > Photography (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Sampling from multi-modal distributions with polynomial query complexity in fixed dimension via reverse diffusion

Neural Information Processing SystemsJun-15-2026, 02:31:11 GMT

Even in low dimensions, sampling from multi-modal distributions is challenging. We provide the first sampling algorithm for a broad class of distributions -- including all Gaussian mixtures -- with a query complexity that is polynomial in the parameters governing multi-modality, assuming fixed dimension. Our sampling algorithm simulates a time-reversed diffusion process, using a self-normalized Monte Carlo estimator of the intermediate score functions. Unlike previous works, it avoids metastability, requires no prior knowledge of the mode locations, and relaxes the well-known log-smoothness assumption which excluded general Gaussian mixtures so far.

log pvt, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

AMinimalist Example of Edge-of-Stability and Progressive Sharpening

Neural Information Processing SystemsJun-15-2026, 02:18:11 GMT

Recent advances in deep learning optimization have unveiled two intriguing phenomena under large learning rates: Edge of Stability (EoS) and Progressive Sharpening (PS), challenging classical Gradient Descent (GD) analyses.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

HyGen: Efficient LLMServing via Elastic Online-Offline Request Co-location

Neural Information Processing SystemsJun-15-2026, 02:17:52 GMT

Large language models (LLMs) have facilitated a wide range of applications with distinct service-level objectives (SLOs), from latency-sensitive online tasks like interactive chatbots to throughput-oriented offline workloads like data synthesis. The existing deployment model, which dedicates machines to each workload, simplifies SLO management but often leads to poor resource utilization. This paper introduces HyGen, an interference-aware LLM serving system that enables efficient co-location of online and offline workloads while preserving SLOs. HyGen incorporates two key innovations: (1) performance control mechanisms, including a latency predictor to estimate batch execution time and an SLO-aware profiler to quantify latency interference, and (2) SLO-aware offline scheduling policies that maximize serving throughput and prevent starvation. Our evaluation on production workloads shows that HyGen achieves up to 3.9-5.8

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.67)
Europe (0.46)

Genre: Research Report > Experimental Study (1.00)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

On Local Limits of Sparse Random Graphs: Color Convergence and the Refined Configuration Model

Neural Information Processing SystemsJun-15-2026, 02:17:10 GMT

Local convergence has emerged as a fundamental tool for analyzing sparse random graph models. We introduce a new notion of local convergence, color convergence, based on the Weisfeiler-Leman algorithm. Color convergence fully characterizes the class of random graphs that are well-behaved in the limit for message-passing graph neural networks. Building on this, we propose the Refined Configuration Model (RCM), a random graph model that generalizes the configuration model. The RCM is universal with respect to local convergence among locally tree-like random graph models, including Erd os-Rényi, stochastic block and configuration models. Finally, this framework enables a complete characterization of the random trees that arise as local limits of such graphs.

artificial intelligence, graph, machine learning, (18 more...)

Neural Information Processing Systems

Country: Europe > Austria (0.28)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)

Add feedback

Scale-invariant attention

Neural Information Processing SystemsJun-15-2026, 02:16:47 GMT

One persistent challenge in LLM research is the development of attention mechanisms that are able to generalise from training on shorter contexts to inference on longer contexts. We propose two conditions that we expect all effective longcontext attention mechanisms to have: scale-invariant total attention, and scaleinvariant attention sparsity. Under a Gaussian assumption, we show that a simple position-dependent transformation of the attention logits is sufficient for these conditions to hold. Experimentally we find that the resulting scale-invariant attention scheme gives considerable benefits in terms of validation loss when zero-shot generalising from training on short contexts to validation on longer contexts, and is effective at long-context retrieval.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

15bbe6ddfc88d8e7f59c8f7d4e2541f5-Paper-Conference.pdf

Neural Information Processing SystemsJun-15-2026, 02:07:21 GMT

In concept erasure, a model is modified to selectively prevent it from generating a target concept. Despite the rapid development of new methods, it remains unclear how thoroughly these approaches remove the target concept from the model. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) interfering with the model's internal guidance processes, and (ii) reducing the unconditional likelihood of generating the target concept, potentially removing it entirely. To assess whether a concept has been truly erased from the model, we introduce a comprehensive suite of independent probing techniques: supplying visual context, modifying the diffusion trajectory, applying classifier guidance, and analyzing the model's alternative generations that emerge in place of the erased concept. Our results shed light on the value of exploring concept erasure robustness outside of adversarial text inputs, and emphasize the importance of comprehensive evaluations for erasure in diffusion models1.

artificial intelligence, arxivpreprintarxiv, machine learning, (16 more...)

Neural Information Processing Systems

Country: Europe > Austria > Vienna (0.14)

Genre: Research Report (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.34)

Add feedback

Unifying Symbolic Music Arrangement: Track-Aware Reconstruction and Structured Tokenization

Neural Information Processing SystemsJun-15-2026, 02:06:54 GMT

We present a unified framework for automatic multitrack music arrangement that enables a single pre-trained symbolic music model to handle diverse arrangement scenarios, including reinterpretation, simplification, and additive generation. At its core is a segment-level reconstruction objective operating on token-level disentangled content and style, allowing for flexible any-to-any instrumentation transformations at inference time. To support track-wise modeling, we introduce REMI-z, a structured tokenization scheme for multitrack symbolic music that enhances modeling efficiency and effectiveness for both arrangement tasks and unconditional generation. Our method outperforms task-specific state-of-the-art models on representative tasks in different arrangement scenarios--band arrangement, piano reduction, and drum arrangement, in both objective metrics and perceptual evaluations. Taken together, our framework demonstrates strong generality and suggests broader applicability in symbolic music-to-music transformation.1

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.28)

Genre: