cog
Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback
Li, Aiden Yiliu, Yu, Bizhi, Lei, Daoan, Ren, Tianhe, Liu, Shilong
GUI grounding aims to align natural-language instructions with precise regions in complex user interfaces (UIs). While advanced MLLMs have demonstrated strong capabilities in visual GUI grounding, they still struggle with small or visually similar targets, and ambiguity in real-world layouts. We argue that these limitations stem not only from the models' inherent grounding capacity, but also from an overlooked un-derutilization of their existing reasoning potential. To address this, we present Chain-of-Ground (CoG), a training-free multi-step grounding framework that leverages MLLMs for iterative visual reasoning and refinement. Instead of relying on direct prediction, Chain-of-Ground enables the model to progressively reflect and adjust its hypotheses, achieving more accurate and interpretable localization. Our approach establishes a new state of the art on the ScreenSpot-Pro benchmark with 68.4% accuracy, surpassing the previous best by 4.8%. To evaluate real-world generalization, we introduce TPanel-UI, a dataset of 420 labeled industrial control panels featuring visual distortions such as blur and masking to test robustness. On TPanel-UI, Chain-of-Ground outperforms the SOTA MLLM Qwen3-VL-235B by 6.9%, demonstrating the effectiveness of multi-step, training-free grounding across real-world and digital interfaces. Together, these results point to a new direction for unlocking MLLMs' grounding potential, through structured, iterative refinement rather than additional training.
- Workflow (0.93)
- Research Report > New Finding (0.67)
Chain-of-Generation: Progressive Latent Diffusion for Text-Guided Molecular Design
Li, Lingxiao, Zhang, Haobo, Chen, Bin, Zhou, Jiayu
Text-conditioned molecular generation aims to translate natural-language descriptions into chemical structures, enabling scientists to specify functional groups, scaffolds, and physicochemical constraints without handcrafted rules. Diffusion-based models, particularly latent diffusion models (LDMs), have recently shown promise by performing stochastic search in a continuous latent space that compactly captures molecular semantics. Yet existing methods rely on one-shot conditioning, where the entire prompt is encoded once and applied throughout diffusion, making it hard to satisfy all the requirements in the prompt. We discuss three outstanding challenges of one-shot conditioning generation, including the poor interpretability of the generated components, the failure to generate all substructures, and the overambition in considering all requirements simultaneously. We then propose three principles to address those challenges, motivated by which we propose Chain-of-Generation (CoG), a training-free multi-stage latent diffusion framework. CoG decomposes each prompt into curriculum-ordered semantic segments and progressively incorporates them as intermediate goals, guiding the denoising trajectory toward molecules that satisfy increasingly rich linguistic constraints. To reinforce semantic guidance, we further introduce a post-alignment learning phase that strengthens the correspondence between textual and molecular latent spaces. Extensive experiments on benchmark and real-world tasks demonstrate that CoG yields higher semantic alignment, diversity, and controllability than one-shot baselines, producing molecules that more faithfully reflect complex, compositional prompts while offering transparent insight into the generation process.
- North America > United States > Michigan (0.04)
- Europe > Greece (0.04)
Composition-Grounded Instruction Synthesis for Visual Reasoning
Gu, Xinyi, Mao, Jiayuan, Hong, Zhang-Wei, Yu, Zhuoran, Li, Pengyuan, Joshi, Dhiraj, Feris, Rogerio, He, Zexue
Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages. Pretrained multi-modal large language models (MLLMs) have achieved impressive performance across a wide range of multimodal tasks (Liu et al., 2023c; Bai et al., 2025; Wang et al., 2025a; Agrawal et al., 2024; OpenAI et al., 2024; Comanici et al., 2025; Anthropic, 2024), yet advanced reasoning capabilities remain underdeveloped, especially in domains where user reasoning-intensive query-answer data is difficult to collect. In this work, we consider reasoning capability over artificial image domains, including charts, tables, information graphs, rendered documents, webpages, etc. While such images are abundant on the web, datasets containing reasoning questions over them are scarce.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Dominican Republic (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Generative Modeling for Robust Deep Reinforcement Learning on the Traveling Salesman Problem
Li, Michael, Bae, Eric, Haberland, Christopher, Jaques, Natasha
The Traveling Salesman Problem (TSP) is a classic NP-hard combinatorial optimization task with numerous practical applications. Classic heuristic solvers can attain near-optimal performance for small problem instances, but become computationally intractable for larger problems. Real-world logistics problems such as dynamically re-routing last-mile deliveries demand a solver with fast inference time, which has led researchers to investigate specialized neural network solvers. However, neural networks struggle to generalize beyond the synthetic data they were trained on. In particular, we show that there exist TSP distributions that are realistic in practice, which also consistently lead to poor worst-case performance for existing neural approaches. To address this issue of distribution robustness, we present Combinatorial Optimization with Generative Sampling (COGS), where training data is sampled from a generative TSP model. We show that COGS provides better data coverage and interpolation in the space of TSP training distributions. We also present TSPLib50, a dataset of realistically distributed TSP samples, which tests real-world generalization ability without conflating this issue with instance size. We evaluate our method on various synthetic datasets as well as TSPLib50, and compare to state-of-the-art neural baselines. We demonstrate that COGS improves distribution robustness, with most performance gains coming from worst-case scenarios.
Foundation Model-Driven Grasping of Unknown Objects via Center of Gravity Estimation
Xiangli, Kang, He, Yage, Gong, Xianwu, Liu, Zehan, Bai, Yuru
This study presents a grasping method for objects with uneven mass distribution by leveraging diffusion models to localize the center of gravity (CoG) on unknown objects. In robotic grasping, CoG deviation often leads to postural instability, where existing keypoint-based or affordance-driven methods exhibit limitations. We constructed a dataset of 790 images featuring unevenly distributed objects with keypoint annotations for CoG localization. A vision-driven framework based on foundation models was developed to achieve CoG-aware grasping. Experimental evaluations across real-world scenarios demonstrate that our method achieves a 49\% higher success rate compared to conventional keypoint-based approaches and an 11\% improvement over state-of-the-art affordance-driven methods. The system exhibits strong generalization with a 76\% CoG localization accuracy on unseen objects, providing a novel solution for precise and stable grasping tasks.
ZeloS -- A Research Platform for Early-Stage Validation of Research Findings Related to Automated Driving
Bohn, Christopher, Siebenrock, Florian, Bosch, Janne, Hetzner, Tobias, Mauch, Samuel, Reis, Philipp, Staudt, Timo, Hess, Manuel, Piscol, Ben-Micha, Hohmann, Sören
This paper presents ZeloS, a research platform designed and built for practical validation of automated driving methods in an early stage of research. We overview ZeloS' hardware setup and automation architecture and focus on motion planning and control. ZeloS weighs 69 kg, measures a length of 117 cm, and is equipped with all-wheel steering, all-wheel drive, and various onboard sensors for localization. The hardware setup and the automation architecture of ZeloS are designed and built with a focus on modularity and the goal of being simple yet effective. The modular design allows the modification of individual automation modules without the need for extensive onboarding into the automation architecture. As such, this design supports ZeloS in being a versatile research platform for validating various automated driving methods. The motion planning component and control of ZeloS feature optimization-based methods that allow for explicitly considering constraints. We demonstrate the hardware and automation setup by presenting experimental data.
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe > Switzerland > Geneva > Geneva (0.04)
- (2 more...)
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks (1.00)
Improving Consistency in Large Language Models through Chain of Guidance
Raj, Harsh, Gupta, Vipul, Rosati, Domenic, Majumdar, Subhabrata
Consistency is a fundamental dimension of trustworthiness in Large Language Models (LLMs). For humans to be able to trust LLM-based applications, their outputs should be consistent when prompted with inputs that carry the same meaning or intent. Despite this need, there is no known mechanism to control and guide LLMs to be more consistent at inference time. In this paper, we introduce a novel alignment strategy to maximize semantic consistency in LLM outputs. Our proposal is based on Chain of Guidance (CoG), a multi-step prompting technique that generates highly consistent outputs from LLMs. For closed-book question-answering (Q&A) tasks, when compared to direct prompting, the outputs generated using CoG show improved consistency. While other approaches like template-based responses and majority voting may offer alternative paths to consistency, our work focuses on exploring the potential of guided prompting. We use synthetic data sets comprised of consistent input-output pairs to fine-tune LLMs to produce consistent and correct outputs. Our fine-tuned models are more than twice as consistent compared to base models and show strong generalization capabilities by producing consistent outputs over datasets not used in the fine-tuning process.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- North America > United States > Pennsylvania (0.04)
- (9 more...)
- Leisure & Entertainment > Sports > Football (1.00)
- Law (1.00)
VISION: A Modular AI Assistant for Natural Human-Instrument Interaction at Scientific User Facilities
Mathur, Shray, van der Vleuten, Noah, Yager, Kevin, Tsai, Esther
Scientific user facilities, such as synchrotron beamlines, are equipped with a wide array of hardware and software tools that require a codebase for human-computer-interaction. This often necessitates developers to be involved to establish connection between users/researchers and the complex instrumentation. The advent of generative AI presents an opportunity to bridge this knowledge gap, enabling seamless communication and efficient experimental workflows. Here we present a modular architecture for the Virtual Scientific Companion (VISION) by assembling multiple AI-enabled cognitive blocks that each scaffolds large language models (LLMs) for a specialized task. With VISION, we performed LLM-based operation on the beamline workstation with low latency and demonstrated the first voice-controlled experiment at an X-ray scattering beamline. The modular and scalable architecture allows for easy adaptation to new instrument and capabilities. Development on natural language-based scientific experimentation is a building block for an impending future where a science exocortex -- a synthetic extension to the cognition of scientists -- may radically transform scientific practice and discovery.
- North America > United States (0.28)
- Asia > South Korea > Gyeonggi-do > Suwon (0.04)
- Research Report > Strength High (0.54)
- Research Report > Experimental Study (0.54)
- Energy > Energy Storage (0.67)
- Electrical Industrial Apparatus (0.67)
- Health & Medicine > Therapeutic Area (0.46)
If AI can provide a better diagnosis than a doctor, what's the prognosis for medics? John Naughton
AI means too many (different) things to too many people. We need better ways of talking – and thinking – about it. Cue, Drew Breunig, a gifted geek and cultural anthropologist, who has come up with a neat categorisation of the technology into three use cases: gods, interns and cogs. "Gods", in this sense, would be "super-intelligent, artificial entities that do things autonomously". In other words, the AGI (artificial general intelligence) that OpenAI's Sam Altman and his crowd are trying to build (at unconscionable expense), while at the same time warning that it could be an existential threat to humanity. AI gods are, Breunig says, the "human replacement use cases".
- Research Report > Experimental Study (0.51)
- Research Report > Strength High (0.32)
- Research Report > New Finding (0.31)
BEATLE - Self-Reconfigurable Aerial Robot: Design, Control and Experimental Validation
Sugihara, Junichiro, Zhao, Moju, Nishio, Takuzumi, Okada, Kei, Inaba, Masayuki
Modular self-reconfigurable robots (MSRRs) offer enhanced task flexibility by constructing various structures suitable for each task. However, conventional terrestrial MSRRs equipped with wheels face critical challenges, including limitations in the size of constructible structures and system robustness due to elevated wrench loads applied to each module. In this work, we introduce an Aerial MSRR (A-MSRR) system named BEATLE, capable of merging and separating in-flight. BEATLE can merge without applying wrench loads to adjacent modules, thereby expanding the scalability and robustness of conventional terrestrial MSRRs. In this article, we propose a system configuration for BEATLE, including mechanical design, a control framework for multi-connected flight, and a motion planner for reconfiguration motion. The design of a docking mechanism and housing structure aims to balance the durability of the constructed structure with ease of separation. Furthermore, the proposed flight control framework achieves stable multi-connected flight based on contact wrench control. Moreover, the proposed motion planner based on a finite state machine (FSM) achieves precise and robust reconfiguration motion. We also introduce the actual implementation of the prototype and validate the robustness and scalability of the proposed system design through experiments and simulation studies.