Goto

Collaborating Authors

 Gui, Zhipeng


PEACE: Empowering Geologic Map Holistic Understanding with MLLMs

arXiv.org Artificial Intelligence

Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth's subsurface and surface. These maps are indispensable in various fields, including disaster detection, resource exploration, and civil engineering. Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding. This gap is primarily due to the challenging nature of cartographic generalization, which involves handling high-resolution map, managing multiple associated components, and requiring domain-specific knowledge. To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding, which assesses the full-scale abilities in extracting, referring, grounding, reasoning, and analyzing. To bridge this gap, we introduce GeoMap-Agent, the inaugural agent designed for geologic map understanding, which features three modules: Hierarchical Information Extraction (HIE), Domain Knowledge Injection (DKI), and Prompt-enhanced Question Answering (PEQA). Inspired by the interdisciplinary collaboration among human scientists, an AI expert group acts as consultants, utilizing a diverse tool pool to comprehensively analyze questions. Through comprehensive experiments, GeoMap-Agent achieves an overall score of 0.811 on GeoMap-Bench, significantly outperforming 0.369 of GPT-4o. Our work, emPowering gEologic mAp holistiC undErstanding (PEACE) with MLLMs, paves the way for advanced AI applications in geology, enhancing the efficiency and accuracy of geological investigations.


GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks

arXiv.org Artificial Intelligence

The increasing demand for spatiotemporal data and modeling tasks in geosciences has made geospatial code generation technology a critical factor in enhancing productivity. Although large language models (LLMs) have demonstrated potential in code generation tasks, they often encounter issues such as refusal to code or hallucination in geospatial code generation due to a lack of domain-specific knowledge and code corpora. To address these challenges, this paper presents and open-sources the GeoCode-PT and GeoCode-SFT corpora, along with the GeoCode-Eval evaluation dataset. Additionally, by leveraging QLoRA and LoRA for pretraining and fine-tuning, we introduce GeoCode-GPT-7B, the first LLM focused on geospatial code generation, fine-tuned from Code Llama-7B. Furthermore, we establish a comprehensive geospatial code evaluation framework, incorporating option matching, expert validation, and prompt engineering scoring for LLMs, and systematically evaluate GeoCode-GPT-7B using the GeoCode-Eval dataset. Experimental results show that GeoCode-GPT outperforms other models in multiple-choice accuracy by 9.1% to 32.1%, in code summarization ability by 1.7% to 25.4%, and in code generation capability by 1.2% to 25.1%. This paper provides a solution and empirical validation for enhancing LLMs' performance in geospatial code generation, extends the boundaries of domain-specific model applications, and offers valuable insights into unlocking their potential in geospatial code generation.


Research on Foundation Model for Spatial Data Intelligence: China's 2024 White Paper on Strategic Development of Spatial Data Intelligence

arXiv.org Artificial Intelligence

Research status and development trends; on this basis, this report proposes three major challenges faced by large spatial data intelligent models today. This report focuses on the current research status of spatial data intelligent large-scale models and sorts out the research progress in four major thematic areas of spatial data intelligent large-scale models: cities, air and space remote sensing, geography, and transportation. This report systematically introduces the key technologies, characteristics and advantages, research status, future development and other core information of spatial data intelligent large models, involving spatiotemporal big data platforms, distributed computing, 3D virtual reality, space The basic performance of large models such as analysis and visualization, as well as the complex spatial comprehensive performance of large models such as geospatial intelligent computing, deep learning, high-performance processing of big data, geographical knowledge graphs, and geographical intelligent multi-scenario simulation, analyze the application of the above key technologies in spatial data The location and role of smart large models.


Interpreting the Curse of Dimensionality from Distance Concentration and Manifold Effect

arXiv.org Artificial Intelligence

The characteristics of data like distribution and heterogeneity, become more complex and counterintuitive as the dimensionality increases. This phenomenon is known as curse of dimensionality, where common patterns and relationships (e.g., internal and boundary pattern) that hold in low-dimensional space may be invalid in higher-dimensional space. It leads to a decreasing performance for the regression, classification or clustering models or algorithms. Curse of dimensionality can be attributed to many causes. In this paper, we first summarize five challenges associated with manipulating high-dimensional data, and explains the potential causes for the failure of regression, classification or clustering tasks. Subsequently, we delve into two major causes of the curse of dimensionality, distance concentration and manifold effect, by performing theoretical and empirical analyses. The results demonstrate that nearest neighbor search (NNS) using three typical distance measurements, Minkowski distance, Chebyshev distance, and cosine distance, becomes meaningless as the dimensionality increases. Meanwhile, the data incorporates more redundant features, and the variance contribution of principal component analysis (PCA) is skewed towards a few dimensions. By interpreting the causes of the curse of dimensionality, we can better understand the limitations of current models and algorithms, and drive to improve the performance of data analysis and machine learning tasks in high-dimensional space.


Scalable manifold learning by uniform landmark sampling and constrained locally linear embedding

arXiv.org Artificial Intelligence

Abstract: As a pivotal approach in machine learning and data science, manifold learning aims to uncover the intrinsic low-dimensional structure within complex nonlinear manifolds in highdimensional space. By exploiting the manifold hypothesis, various techniques for nonlinear dimension reduction have been developed to facilitate visualization, classification, clustering, and gaining key insights. Although existing manifold learning methods have achieved remarkable successes, they still suffer from extensive distortions incurred in the global structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. Here, we propose a scalable manifold learning (scML) method that can manipulate large-scale and high-dimensional data in an efficient manner. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data, and then incorporates the nonlandmarks into the learned space based on the constrained locally linear embedding (CLLE). We empirically validated the effectiveness of scML on synthetic datasets and real-world benchmarks of different types, and applied it to analyze the single-cell transcriptomics and detect anomalies in electrocardiogram (ECG) signals. The experiments demonstrate notable robustness in embedding quality as the sample rate decreases. Dimension reduction plays an indispensable role in both preprocessing for machine learning tasks and visualization for high-dimensional data [1, 2]. It is often applied to address the curse of dimensionality in data science, which refers to the phenomenon where the amount of data required to achieve a certain level of accuracy increases exponentially as the number of dimensions increases [3]. This makes models difficult to represent the features comprehensively and may lead to an overfitting problem [4].


MeanCut: A Greedy-Optimized Graph Clustering via Path-based Similarity and Degree Descent Criterion

arXiv.org Artificial Intelligence

As the most typical graph clustering method, spectral clustering is popular and attractive due to the remarkable performance, easy implementation, and strong adaptability. Classical spectral clustering measures the edge weights of graph using pairwise Euclidean-based metric, and solves the optimal graph partition by relaxing the constraints of indicator matrix and performing Laplacian decomposition. However, Euclidean-based similarity might cause skew graph cuts when handling non-spherical data distributions, and the relaxation strategy introduces information loss. Meanwhile, spectral clustering requires specifying the number of clusters, which is hard to determine without enough prior knowledge. In this work, we leverage the path-based similarity to enhance intra-cluster associations, and propose MeanCut as the objective function and greedily optimize it in degree descending order for a nondestructive graph partition. This algorithm enables the identification of arbitrary shaped clusters and is robust to noise. To reduce the computational complexity of similarity calculation, we transform optimal path search into generating the maximum spanning tree (MST), and develop a fast MST (FastMST) algorithm to further improve its time-efficiency. Moreover, we define a density gradient factor (DGF) for separating the weakly connected clusters. The validity of our algorithm is demonstrated by testifying on real-world benchmarks and application of face recognition. The source code of MeanCut is available at https://github.com/ZPGuiGroupWhu/MeanCut-Clustering.


A Robust and Efficient Boundary Point Detection Method by Measuring Local Direction Dispersion

arXiv.org Artificial Intelligence

Boundary points pose a significant challenge for machine learning tasks, including classification, clustering, and dimensionality reduction. Due to the similarity of features, boundary areas can result in mixed-up classes or clusters, leading to a crowding problem in dimensionality reduction. To address this challenge, numerous boundary point detection methods have been developed, but they are insufficiently to accurately and efficiently identify the boundary points in non-convex structures and high-dimensional manifolds. In this work, we propose a robust and efficient method for detecting boundary points using Local Direction Dispersion (LoDD). LoDD considers that internal points are surrounded by neighboring points in all directions, while neighboring points of a boundary point tend to be distributed only in a certain directional range. LoDD adopts a density-independent K-Nearest Neighbors (KNN) method to determine neighboring points, and defines a statistic-based metric using the eigenvalues of the covariance matrix of KNN coordinates to measure the centrality of a query point. We demonstrated the validity of LoDD on five synthetic datasets (2-D and 3-D) and ten real-world benchmarks, and tested its clustering performance by equipping with two typical clustering methods, K-means and Ncut. Our results show that LoDD achieves promising and robust detection accuracy in a time-efficient manner.