Scale-Free Graph-Language Models

Lu, Jianglin, Liu, Yixuan, Zhang, Yitian, Fu, Yun

arXiv.org Artificial Intelligence 

Graph-language models (GLMs) have demonstrated great potential in graph-based semi-supervised learning. A typical GLM consists of two key stages: graph generation and text embedding, which are usually implemented by inferring a latent graph and finetuning a language model (LM), respectively. However, the former often relies on artificial assumptions about the underlying edge distribution, while the latter requires extensive data annotations. To tackle these challenges, this paper introduces a novel GLM that integrates graph generation and text embedding within a unified framework. We unexpectedly find that this natural property can be effectively approximated by a simple k -nearest neighbor (KNN) graph. For text embedding, we develop a graph-based pseudo-labeler that utilizes scale-free graphs to provide complementary supervision for improved LM finetuning. Extensive experiments on representative datasets validate our findings on the scale-free structural approximation of KNN graphs and demonstrate the effectiveness of integrating graph generation and text embedding with a real structural prior. Recently, graph-language models (GLMs) have been widely explored in graph-based semi-supervised classification on documents, especially for citation networks (Qin et al., 2023; Y u et al., 2025; Lu et al., 2023; He et al., 2024). When designing a GLM for classification, two key challenges arise: graph generation --how to generate a reasonable graph structure for the given documents, and text embedding --how to encode the textual sequences into meaningful semantic features. To address these problems, various GLMs have been proposed, which can be broadly categorized into latent graph inference (LGI) models and language-assisted graph (LAG) models. LGI models focus on graph generation and typically rely on feature engineering approaches, such as bag-of-words (Harris, 1954), TF-IDF (Aizawa, 2003), and skip-gram (Mikolov et al., 2013), to encode textual sequences into shallow representations.