AITopics | current screen

Collaborating Authors

current screen

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GUI-Reflection: Empowering Multimodal GUIModels with Self-Reflection Behavior Task: Find the size of the file.Penghao Wu, Shengnan Ma, Bo Wang, Jiaheng Yu, Lewei Lu, Ziwei Liu

Neural Information Processing SystemsJun-19-2026, 21:27:51 GMT

Multimodal Large Language Models (MLLMs) have shown great potential in re GUI volutionizing models mostly Graphical rely on User learning Interf from ace nearly (GUI) error automation.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Genre:

Workflow (1.00)
Research Report > Experimental Study (0.46)

Industry: Education (0.94)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior

Wu, Penghao, Ma, Shengnan, Wang, Bo, Yu, Jiaheng, Lu, Lewei, Liu, Ziwei

arXiv.org Artificial IntelligenceJun-10-2025

Multimodal Large Language Models (MLLMs) have shown great potential in revolutionizing Graphical User Interface (GUI) automation. However, existing GUI models mostly rely on learning from nearly error-free offline trajectories, thus lacking reflection and error recovery capabilities. To bridge this gap, we propose GUI-Reflection, a novel framework that explicitly integrates self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout dedicated training stages: GUI-specific pre-training, offline supervised fine-tuning (SFT), and online reflection tuning. GUI-reflection enables self-reflection behavior emergence with fully automated data generation and learning processes without requiring any human annotation. Specifically, 1) we first propose scalable data pipelines to automatically construct reflection and error correction data from existing successful trajectories. While existing GUI models mainly focus on grounding and UI understanding ability, we propose the GUI-Reflection Task Suite to learn and evaluate reflection-oriented abilities explicitly. 2) Furthermore, we built a diverse and efficient environment for online training and data collection of GUI models on mobile devices. 3) We also present an iterative online reflection tuning algorithm leveraging the proposed environment, enabling the model to continuously enhance its reflection and error correction abilities. Our framework equips GUI agents with self-reflection and correction capabilities, paving the way for more robust, adaptable, and intelligent GUI automation, with all data, models, environments, and tools to be released publicly.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.08012

Genre:

Workflow (1.00)
Research Report (0.81)

Industry: Education > Educational Setting > Online (0.69)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Bonatti, Rogerio, Zhao, Dan, Bonacci, Francesco, Dupont, Dillon, Abdali, Sara, Li, Yinheng, Lu, Yadong, Wagle, Justin, Koishida, Kazuhito, Bucker, Arthur, Jang, Lawrence, Hui, Zack

arXiv.org Artificial IntelligenceSep-13-2024

Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena

agent, computer, indow, (14 more...)

arXiv.org Artificial Intelligence

2409.08264

Country:

Europe > Sweden > Stockholm > Stockholm (0.04)
Asia > India > Maharashtra > Mumbai (0.04)
North America > United States > Michigan > Jackson County > Jackson (0.04)
(3 more...)

Genre:

Research Report (0.82)
Workflow (0.68)

Industry:

Information Technology > Software (0.67)
Health & Medicine > Therapeutic Area (0.46)
Leisure & Entertainment > Sports > Baseball (0.45)
Health & Medicine > Consumer Health (0.45)

Technology:

Information Technology > Software (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(4 more...)

Add feedback

Latent State Estimation Helps UI Agents to Reason

Bishop, William E, Li, Alice, Rawles, Christopher, Riva, Oriana

arXiv.org Artificial IntelligenceMay-17-2024

A common problem for agents operating in real-world environments is that the response of an environment to their actions may be non-deterministic and observed through noise. This renders environmental state and progress towards completing a task latent. Despite recent impressive demonstrations of LLM's reasoning abilities on various benchmarks, whether LLMs can build estimates of latent state and leverage them for reasoning has not been explicitly studied. We investigate this problem in the real-world domain of autonomous UI agents. We establish that appropriately prompting LLMs in a zero-shot manner can be formally understood as forming point estimates of latent state in a textual space. In the context of autonomous UI agents we then show that LLMs used in this manner are more than $76\%$ accurate at inferring various aspects of latent state, such as performed (vs. commanded) actions and task progression. Using both public and internal benchmarks and three reasoning methods (zero-shot, CoT-SC & ReAct), we show that LLM-powered agents that explicitly estimate and reason about latent state are able to successfully complete up to 1.6x more tasks than those that do not.

agent, current screen, latent state, (14 more...)

arXiv.org Artificial Intelligence

2405.1112

Country: Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)

Genre:

Workflow (1.00)
Research Report > New Finding (0.46)

Industry:

Health & Medicine (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback