Goto

Collaborating Authors

 pixel coordinate






Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Wang, Suyuchen, Zhang, Tianyu, Masry, Ahmed, Pal, Christopher, Gella, Spandana, Liu, Bang, Taslakian, Perouz

arXiv.org Artificial Intelligence

GUI grounding is the task of mapping natural language instructions to precise pixel coordinates in graphical user interfaces, enabling autonomous agents to interact with software as humans do (Zhang et al., 2025a; Wang et al., 2024a; Zheng et al., 2024). This capability is fundamental for computer automation: without accurate grounding, agents cannot click buttons, fill forms, or navigate interfaces reliably. Although early approaches relied on structured metadata from HTML/DOM trees or accessibility APIs (Li et al., 2020; Deng et al., 2023), these methods face significant limitations: they require access to the underlying UI structure, which is often unavailable in desktop applications, inconsistent across platforms, or completely absent in legacy systems. Pure vision-based grounding, which operates directly on screenshots, offers universal applicability across any visual interface without requiring special access or instrumentation (Qin et al., 2025; Wang et al., 2025b; Guo et al., 2025). This approach mirrors human interaction with GUIs and enables automation of any software visible on screen, from modern web applications to legacy desktop tools. Current vision-based approaches typically formulate GUI grounding as a coordinate generation task, where models output pixel positions as text tokens (e.g., "x=523, y=217").


Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Xu, Runsen, Wang, Weiyao, Tang, Hao, Chen, Xingyu, Wang, Xiaodong, Chu, Fu-Jen, Lin, Dahua, Feiszli, Matt, Liang, Kevin J.

arXiv.org Artificial Intelligence

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.


Keypoint Detection Technique for Image-Based Visual Servoing of Manipulators

Amiri, Niloufar, Wang, Guanghui, Janabi-Sharifi, Farrokh

arXiv.org Artificial Intelligence

This paper introduces an innovative keypoint detection technique based on Convolutional Neural Networks (CNNs) to enhance the performance of existing Deep Visual Servoing (DVS) models. To validate the convergence of the Image-Based Visual Servoing (IBVS) algorithm, real-world experiments utilizing fiducial markers for feature detection are conducted before designing the CNN-based feature detector. To address the limitations of fiducial markers, the novel feature detector focuses on extracting keypoints that represent the corners of a more realistic object compared to fiducial markers. A dataset is generated from sample data captured by the camera mounted on the robot end-effector while the robot operates randomly in the task space. The samples are automatically labeled, and the dataset size is increased by flipping and rotation. The CNN model is developed by modifying the VGG-19 pre-trained on the ImageNet dataset. While the weights in the base model remain fixed, the fully connected layer's weights are updated to minimize the mean absolute error, defined based on the deviation of predictions from the real pixel coordinates of the corners. The model undergoes two modifications: replacing max-pooling with average-pooling in the base model and implementing an adaptive learning rate that decreases during epochs. These changes lead to a 50 percent reduction in validation loss. Finally, the trained model's reliability is assessed through k-fold cross-validation.


Collaborative Aquatic Positioning System Utilising Multi-beam Sonar and Depth Sensors

Cheng, {Xueliang, Lennox, Barry, Groves, Keir

arXiv.org Artificial Intelligence

Accurate positioning of remotely operated underwater vehicles (ROVs) in confined environments is crucial for inspection and mapping tasks and is also a prerequisite for autonomous operations. Presently, there are no positioning systems available that are suited for real-world use in confined underwater environments, unconstrained by environmental lighting and water turbidity levels and have sufficient accuracy for long-term, reliable and repeatable navigation. This shortage presents a significant barrier to enhancing the capabilities of ROVs in such scenarios. This paper introduces an innovative positioning system for ROVs operating in confined, cluttered underwater settings, achieved through the collaboration of an omnidirectional surface vehicle and an ROV. A formulation is proposed and evaluated in the simulation against ground truth. The experimental results from the simulation form a proof of principle of the proposed system and also demonstrate its deployability. Unlike many previous approaches, the system does not rely on fixed infrastructure or tracking of features in the environment and can cover large enclosed areas without additional equipment.


High-accuracy Vision-Based Attitude Estimation System for Air-Bearing Spacecraft Simulators

Ornati, Fabio, Di Domenico, Gianfranco, Panicucci, Paolo, Topputo, Francesco

arXiv.org Artificial Intelligence

Air-bearing platforms for simulating the rotational dynamics of satellites require highly precise ground truth systems. Unfortunately, commercial motion capture systems used for this scope are complex and expensive. This paper shows a novel and versatile method to compute the attitude of rotational air-bearing platforms using a monocular camera and sets of fiducial markers. The work proposes a geometry-based iterative algorithm that is significantly more accurate than other literature methods that involve the solution of the Perspective-n-Point problem. Additionally, auto-calibration procedures to perform a preliminary estimation of the system parameters are shown. The developed methodology is deployed onto a Raspberry Pi 4 micro-computer and tested with a set of LED markers. Data obtained with this setup are compared against computer simulations of the same system to understand and validate the attitude estimation performances. Simulation results show expected 1-sigma accuracies in the order of $\sim$ 12 arcsec and $\sim$ 37 arcsec for about- and cross-boresight rotations of the platform, and average latency times of 6 ms.


Monocular UAV Localisation with Deep Learning and Uncertainty Propagation

Oh, Xueyan, Lim, Ryan, Loh, Leonard, Tan, Chee How, Foong, Shaohui, Tan, U-Xuan

arXiv.org Artificial Intelligence

In this paper, we propose a ground-based monocular UAV localisation system that detects and localises an LED marker attached to the underside of a UAV. Our system removes the need for extensive infrastructure and calibration unlike existing technologies such as UWB, radio frequency and multi-camera systems often used for localisation in GPS-denied environment. To improve deployablity for real-world applications without the need to collect extensive real dataset, we train a CNN on synthetic binary images as opposed to using real images in existing monocular UAV localisation methods, and factor in the camera's zoom to allow tracking of UAVs flying at further distances. We propose NoisyCutout algorithm for augmenting synthetic binary images to simulate binary images processed from real images and show that it improves localisation accuracy as compared to using existing salt-and-pepper and Cutout augmentation methods. We also leverage uncertainty propagation to modify the CNN's loss function and show that this also improves localisation accuracy. Real-world experiments are conducted to evaluate our methods and we achieve an overall 3D RMSE of approximately 0.41m.