AITopics | pixel coordinate

Country:

North America > United States > New York (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)

Technology: Information Technology > Artificial Intelligence (0.69)

Neural Information Processing SystemsFeb-7-2026, 23:53:14 GMT

Fine-Grained Cross-View Geo-Localization Using a Correlation-Aware Homography Estimator Xiaolong Wang

In this paper, we introduce a novel approach to fine-grained cross-view geo-localization.

artificial intelligence, machine learning, satellite image, (14 more...)

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
(4 more...)

Genre: Research Report > Promising Solution (0.66)

Industry: Information Technology (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (0.70)
Information Technology > Geographic Information Systems (0.68)
(2 more...)

Neural Information Processing SystemsOct-8-2025, 03:37:34 GMT

112d8e0c7563de6e3408b49a09b4d8a3-Supplemental-Conference.pdf

artificial intelligence, pixel coordinate, view image, (15 more...)

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)

Technology: Information Technology > Artificial Intelligence (0.70)

Neural Information Processing SystemsOct-8-2025, 03:37:31 GMT

112d8e0c7563de6e3408b49a09b4d8a3-Paper-Conference.pdf

artificial intelligence, machine learning, satellite image, (14 more...)

Country:

North America > United States > New York (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
(5 more...)

Genre: Research Report (0.68)

Industry: Information Technology (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (0.70)
Information Technology > Geographic Information Systems (0.68)
(2 more...)

arXiv.org Artificial IntelligenceOct-6-2025

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Wang, Suyuchen, Zhang, Tianyu, Masry, Ahmed, Pal, Christopher, Gella, Spandana, Liu, Bang, Taslakian, Perouz

GUI grounding is the task of mapping natural language instructions to precise pixel coordinates in graphical user interfaces, enabling autonomous agents to interact with software as humans do (Zhang et al., 2025a; Wang et al., 2024a; Zheng et al., 2024). This capability is fundamental for computer automation: without accurate grounding, agents cannot click buttons, fill forms, or navigate interfaces reliably. Although early approaches relied on structured metadata from HTML/DOM trees or accessibility APIs (Li et al., 2020; Deng et al., 2023), these methods face significant limitations: they require access to the underlying UI structure, which is often unavailable in desktop applications, inconsistent across platforms, or completely absent in legacy systems. Pure vision-based grounding, which operates directly on screenshots, offers universal applicability across any visual interface without requiring special access or instrumentation (Qin et al., 2025; Wang et al., 2025b; Guo et al., 2025). This approach mirrors human interaction with GUIs and enables automation of any software visible on screen, from modern web applications to legacy desktop tools. Current vision-based approaches typically formulate GUI grounding as a coordinate generation task, where models output pixel positions as text tokens (e.g., "x=523, y=217").

machine learning, natural language, wang, (16 more...)

2510.0323

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

arXiv.org Artificial IntelligenceMay-23-2025

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Xu, Runsen, Wang, Weiyao, Tang, Hao, Chen, Xingyu, Wang, Xiaodong, Chu, Fu-Jen, Lin, Dahua, Feiszli, Matt, Liang, Kevin J.

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

large language model, machine learning, natural language, (20 more...)

2505.17015

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Amiri, Niloufar, Wang, Guanghui, Janabi-Sharifi, Farrokh

Keypoint Detection Technique for Image-Based Visual Servoing of Manipulators

arXiv.org Artificial IntelligenceSep-20-2024

This paper introduces an innovative keypoint detection technique based on Convolutional Neural Networks (CNNs) to enhance the performance of existing Deep Visual Servoing (DVS) models. To validate the convergence of the Image-Based Visual Servoing (IBVS) algorithm, real-world experiments utilizing fiducial markers for feature detection are conducted before designing the CNN-based feature detector. To address the limitations of fiducial markers, the novel feature detector focuses on extracting keypoints that represent the corners of a more realistic object compared to fiducial markers. A dataset is generated from sample data captured by the camera mounted on the robot end-effector while the robot operates randomly in the task space. The samples are automatically labeled, and the dataset size is increased by flipping and rotation. The CNN model is developed by modifying the VGG-19 pre-trained on the ImageNet dataset. While the weights in the base model remain fixed, the fully connected layer's weights are updated to minimize the mean absolute error, defined based on the deviation of predictions from the real pixel coordinates of the corners. The model undergoes two modifications: replacing max-pooling with average-pooling in the base model and implementing an adaptive learning rate that decreases during epochs. These changes lead to a 50 percent reduction in validation loss. Finally, the trained model's reliability is assessed through k-fold cross-validation.

artificial intelligence, detection, machine learning, (15 more...)

2409.13668

Country:

North America > Canada > Ontario > Toronto (0.15)
Europe > Italy > Tuscany > Florence (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Cheng, {Xueliang, Lennox, Barry, Groves, Keir

Collaborative Aquatic Positioning System Utilising Multi-beam Sonar and Depth Sensors

arXiv.org Artificial IntelligenceMar-18-2024

Accurate positioning of remotely operated underwater vehicles (ROVs) in confined environments is crucial for inspection and mapping tasks and is also a prerequisite for autonomous operations. Presently, there are no positioning systems available that are suited for real-world use in confined underwater environments, unconstrained by environmental lighting and water turbidity levels and have sufficient accuracy for long-term, reliable and repeatable navigation. This shortage presents a significant barrier to enhancing the capabilities of ROVs in such scenarios. This paper introduces an innovative positioning system for ROVs operating in confined, cluttered underwater settings, achieved through the collaboration of an omnidirectional surface vehicle and an ROV. A formulation is proposed and evaluated in the simulation against ground truth. The experimental results from the simulation form a proof of principle of the proposed system and also demonstrate its deployability. Unlike many previous approaches, the system does not rely on fixed infrastructure or tracking of features in the environment and can cover large enclosed areas without additional equipment.

multi-beam sonar, robot, vehicle, (14 more...)

2403.10397

Country:

Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)

Genre: Research Report (0.82)

Industry:

Electrical Industrial Apparatus (0.99)
Energy > Power Industry > Utilities > Nuclear (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Sensing and Signal Processing (0.68)

Ornati, Fabio, Di Domenico, Gianfranco, Panicucci, Paolo, Topputo, Francesco

High-accuracy Vision-Based Attitude Estimation System for Air-Bearing Spacecraft Simulators

arXiv.org Artificial IntelligenceDec-13-2023

Air-bearing platforms for simulating the rotational dynamics of satellites require highly precise ground truth systems. Unfortunately, commercial motion capture systems used for this scope are complex and expensive. This paper shows a novel and versatile method to compute the attitude of rotational air-bearing platforms using a monocular camera and sets of fiducial markers. The work proposes a geometry-based iterative algorithm that is significantly more accurate than other literature methods that involve the solution of the Perspective-n-Point problem. Additionally, auto-calibration procedures to perform a preliminary estimation of the system parameters are shown. The developed methodology is deployed onto a Raspberry Pi 4 micro-computer and tested with a set of LED markers. Data obtained with this setup are compared against computer simulations of the same system to understand and validate the attitude estimation performances. Simulation results show expected 1-sigma accuracies in the order of $\sim$ 12 arcsec and $\sim$ 37 arcsec for about- and cross-boresight rotations of the platform, and average latency times of 6 ms.

artificial intelligence, platform, video understanding, (18 more...)

2312.08146

Country:

North America > United States > Texas > Galveston County > Galveston (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Italy > Lombardy > Milan (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Sensing and Signal Processing (0.95)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.48)

arXiv.org Artificial IntelligenceNov-6-2023

Monocular UAV Localisation with Deep Learning and Uncertainty Propagation

Oh, Xueyan, Lim, Ryan, Loh, Leonard, Tan, Chee How, Foong, Shaohui, Tan, U-Xuan

In this paper, we propose a ground-based monocular UAV localisation system that detects and localises an LED marker attached to the underside of a UAV. Our system removes the need for extensive infrastructure and calibration unlike existing technologies such as UWB, radio frequency and multi-camera systems often used for localisation in GPS-denied environment. To improve deployablity for real-world applications without the need to collect extensive real dataset, we train a CNN on synthetic binary images as opposed to using real images in existing monocular UAV localisation methods, and factor in the camera's zoom to allow tracking of UAVs flying at further distances. We propose NoisyCutout algorithm for augmenting synthetic binary images to simulate binary images processed from real images and show that it improves localisation accuracy as compared to using existing salt-and-pepper and Cutout augmentation methods. We also leverage uncertainty propagation to modify the CNN's loss function and show that this also improves localisation accuracy. Real-world experiments are conducted to evaluate our methods and we achieve an overall 3D RMSE of approximately 0.41m.

artificial intelligence, deep learning, machine learning, (18 more...)