Efficient Identification of High Similarity Clusters in Polygon Datasets

Sep-30-2025–arXiv.org Artificial Intelligence

Abstract--Advancements in tools like Shapely 2.0 and Triton can significantly improve the efficiency of spatial similarity computations by enabling faster and more scalable geometric operations [1], [2]. However, for extremely large datasets, these optimizations may face challenges due to the sheer volume of computations required. T o address this, we propose a framework that reduces the number of clusters requiring verification, thereby decreasing the computational load on these systems. The framework integrates dynamic similarity index thresholding, supervised scheduling [3], and recall-constrained optimization to efficiently identify clusters with the highest spatial similarity while meeting user-defined precision and recall requirements [4]. By leveraging Kernel Density Estimation (KDE) to dynamically determine similarity thresholds [5] and machine learning models to prioritize clusters, our approach achieves substantial reductions in computational cost without sacrificing accuracy. Experimental results demonstrate the scalability and effectiveness of the method, offering a practical solution for large-scale geospatial analysis. Geospatial data constitutes the cornerstone of numerous applications across various domains, including urban planning, environmental monitoring, infrastructure development, and medicine. For example, OpenStreetMap contains global data amounting to over 1.5 terabytes [6], while GeoNames describes more than 12 million locations, providing extensive point geometries such as latitude and longitude [7]. Expanding these datasets, geospatial knowledge graphs like Y AGO2geo integrate millions of lines, polygons, and multi-polygons from OpenStreetMap and administrative divisions [8], while WorldKG represents around 113.4 million geographic entities [9]. KnowWhereGraph, a more recent initiative, comprises over 12 billion RDF triples, including data on polygons and multipolygons, and supports applications in disaster relief, agricultural land use, and food-related supply chains [10]. Even cross-domain knowledge graphs such as DBpedia and Wikidata incorporate a substantial amount of geospatial information, underscoring the critical role of spatial data on the Web. Beyond these well-known repositories, spatial datasets also play a transformative role in medicine, particularly in the analysis and modeling of organ structures. For instance, the Visible Human Project provides high-resolution spatial data for anatomical structures [11], while the Human Connectome Project captures detailed spatial relationships within the brain [12].

artificial intelligence, geometry, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.66)

Industry:
- Health & Medicine (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Spatial Reasoning (1.00)
  - Machine Learning > Performance Analysis
    - Accuracy (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found