Park, Youngser
Principal Graph Encoder Embedding and Principal Community Detection
Shen, Cencheng, Dong, Yuexiao, Priebe, Carey E., Larson, Jonathan, Trinh, Ha, Park, Youngser
In this paper, we introduce the concept of principal communities and propose a principal graph encoder embedding method that concurrently detects these communities and achieves vertex embedding. Given a graph adjacency matrix with vertex labels, the method computes a sample community score for each community, ranking them to measure community importance and estimate a set of principal communities. The method then produces a vertex embedding by retaining only the dimensions corresponding to these principal communities. Theoretically, we define the population version of the encoder embedding and the community score based on a random Bernoulli graph distribution. We prove that the population principal graph encoder embedding preserves the conditional density of the vertex labels and that the population community score successfully distinguishes the principal communities. We conduct a variety of simulations to demonstrate the finite-sample accuracy in detecting ground-truth principal communities, as well as the advantages in embedding visualization and subsequent vertex classification. The method is further applied to a set of real-world graphs, showcasing its numerical advantages, including robustness to label noise and computational scalability.
Embedding-based statistical inference on generative models
Helm, Hayden, Acharyya, Aranyak, Duderstadt, Brandon, Park, Youngser, Priebe, Carey E.
The recent cohort of publicly available generative models can produce human expert level content across a variety of topics and domains. Given a model in this cohort as a base model, methods such as parameter efficient fine-tuning, in-context learning, and constrained decoding have further increased generative capabilities and improved both computational and data efficiency. Entire collections of derivative models have emerged as a byproduct of these methods and each of these models has a set of associated covariates such as a score on a benchmark, an indicator for if the model has (or had) access to sensitive information, etc. that may or may not be available to the user. For some model-level covariates, it is possible to use "similar" models to predict an unknown covariate. In this paper we extend recent results related to embedding-based representations of generative models - the data kernel perspective space - to classical statistical inference settings. We demonstrate that using the perspective space as the basis of a notion of "similar" is effective for multiple model-level inference tasks. Generative models have recently met or surpassed human-level standards on benchmarks across a range of tasks (Nori et al., 2023; Katz et al., 2024; Dubey et al., 2024). While these claims warrant skepticism and further robustness evaluation (Ness et al., 2024), the impressive capabilities have created a competitive environment for training state-of-the-art models and have inspired the development of complementary methods to adapt models to particular use cases.
Tracking the perspectives of interacting language models
Helm, Hayden, Duderstadt, Brandon, Park, Youngser, Priebe, Carey E.
Large language models (LLMs) are capable of producing high quality information at unprecedented rates. As these models continue to entrench themselves in society, the content they produce will become increasingly pervasive in databases that are, in turn, incorporated into the pre-training data, fine-tuning data, retrieval data, etc. of other language models. In this paper we formalize the idea of a communication network of LLMs and introduce a method for representing the perspective of individual models within a collection of LLMs. Given these tools we systematically study information diffusion in the communication network of LLMs in various simulated settings. The success of large pre-trained models in natural language processing (Devlin et al., 2018), computer vision (Oquab et al., 2023), signal processing (Radford et al., 2023), among other domains (Jumper et al., 2021) across various computing and human benchmarks has brought them to the forefront of the technology-centric world. Given their ability to produce human-expert level responses for a large set of knowledge-based questions (Touvron et al., 2023; Achiam et al., 2023), the content they produce is often propagated throughout forums that have influence over other models and human users (Brinkmann et al., 2023). As such, it is important to develop sufficient frameworks and complementary tools to understand how information produced by these models affects the behavior of other models and human users. We refer to a system where a model can potentially influence other models as a system of interacting language models.
Discovering Communication Pattern Shifts in Large-Scale Labeled Networks using Encoder Embedding and Vertex Dynamics
Shen, Cencheng, Larson, Jonathan, Trinh, Ha, Qin, Xihan, Park, Youngser, Priebe, Carey E.
Analyzing large-scale time-series network data, such as social media and email communications, poses a significant challenge in understanding social dynamics, detecting anomalies, and predicting trends. In particular, the scalability of graph analysis is a critical hurdle impeding progress in large-scale downstream inference. To address this challenge, we introduce a temporal encoder embedding method. This approach leverages ground-truth or estimated vertex labels, enabling an efficient embedding of large-scale graph data and the processing of billions of edges within minutes. Furthermore, this embedding unveils a temporal dynamic statistic capable of detecting communication pattern shifts across all levels, ranging from individual vertices to vertex communities and the overall graph structure. We provide theoretical support to confirm its soundness under random graph models, and demonstrate its numerical advantages in capturing evolving communities and identifying outliers. Finally, we showcase the practical application of our approach by analyzing an anonymized time-series communication network from a large organization spanning 2019-2020, enabling us to assess the impact of Covid-19 on workplace communication patterns.
Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection
Shen, Cencheng, Park, Youngser, Priebe, Carey E.
Typically, a graph (or network) is represented by an adjacency matrix A of size, where A(,) denotes the edge weight between the th and th vertices. Alternatively, the graph can be stored in an edgelist E of size 3, with the first two columns indicating the vertex indices of each edge and the last column representing the edge weight. Community detection, also known as vertex clustering or graph partitioning, is a fundamental problem in graph analysis [6, 8, 10, 13]. The primary objective is to identify natural groups of vertices where intra-group connections are stronger than inter-group connections. Over the years, various approaches have been proposed, including modularitybased methods [2, 22], spectral-based methods [15, 21], and likelihood-based techniques [1, 7], among others.
Semisupervised regression in latent structure networks on unknown manifolds
Acharyya, Aranyak, Agterberg, Joshua, Trosset, Michael W., Park, Youngser, Priebe, Carey E.
Random graphs are increasingly becoming objects of interest for modeling networks in a wide range of applications. Latent position random graph models posit that each node is associated with a latent position vector, and that these vectors follow some geometric structure in the latent space. In this paper, we consider random dot product graphs, in which an edge is formed between two nodes with probability given by the inner product of their respective latent positions. We assume that the latent position vectors lie on an unknown one-dimensional curve and are coupled with a response covariate via a regression model. Using the geometry of the underlying latent position vectors, we propose a manifold learning and graph embedding technique to predict the response variable on out-of-sample nodes, and we establish convergence guarantees for these responses. Our theoretical results are supported by simulations and an application to Drosophila brain data.
Dynamic Network Sampling for Community Detection
Mu, Cong, Park, Youngser, Priebe, Carey E.
We propose a dynamic network sampling scheme to optimize block recovery for stochastic blockmodel (SBM) in the case where it is prohibitively expensive to observe the entire graph. Theoretically, we provide justification of our proposed Chernoff-optimal dynamic sampling scheme via the Chernoff information. Practically, we evaluate the performance, in terms of block recovery, of our method on several real datasets from different domains. Both theoretically and practically results suggest that our method can identify vertices that have the most impact on block structure so that one can only check whether there are edges between them to save significant resources but still recover the block structure.
Learning to rank via combining representations
Helm, Hayden S., Basu, Amitabh, Athreya, Avanti, Park, Youngser, Vogelstein, Joshua T., Winding, Michael, Zlatic, Marta, Cardona, Albert, Bourke, Patrick, Larson, Jonathan, White, Chris, Priebe, Carey E.
Learning to rank - producing a ranked list of items specific to a query and with respect to a set of supervisory items - is a problem of general interest. The setting we consider is one in which no analytic description of what constitutes a good ranking is available. Instead, we have a collection of representations and supervisory information consisting of a (target item, interesting items set) pair. We demonstrate - analytically, in simulation, and in real data examples - that learning to rank via combining representations using an integer linear program is effective when the supervision is as light as "these few items are similar to your item of interest." While this nomination task is of general interest, for specificity we present our methodology from the perspective of vertex nomination in graphs. The methodology described herein is model agnostic. Introduction Given a query, a collection of items, and supervisory information, producing a ranked list relative to the query is of general interest. In particular, learning to rank [1] and algorithms from related problem settings [2] have been used to improve popular search engines and recommender systems and, impressively, aid in the identification of human traffickers [3]. When learning to rank, for each training query researchers typically have access to (feature vector, ordinal) pairs that are used to learn an ordinal regressor via fitting a model under a set of probabilistic assumptions [4] or via deep learning techniques [5] that generalize to ranking items for never-beforeseen queries.
Vertex Nomination, Consistent Estimation, and Adversarial Modification
Agterberg, Joshua, Park, Youngser, Larson, Jonathan, White, Christopher, Priebe, Carey E., Lyzinski, Vince
Given a pair of graphs $G_1$ and $G_2$ and a vertex set of interest in $G_1$, the vertex nomination problem seeks to find the corresponding vertices of interest in $G_2$ (if they exist) and produce a rank list of the vertices in $G_2$, with the corresponding vertices of interest in $G_2$ concentrating, ideally, at the top of the rank list. In this paper we study the effect of an adversarial contamination model on the performance of a spectral graph embedding-based vertex nomination scheme. In both real and simulated examples, we demonstrate that this vertex nomination scheme performs effectively in the uncontaminated setting; adversarial network contamination adversely impacts the performance of our VN scheme; and network regularization successfully mitigates the impact of the contamination. In addition to furthering the theoretic basis of consistency in vertex nomination, the adversarial noise model posited herein is grounded in theoretical developments that allow us to frame the role of an adversary in terms of maximal vertex nomination consistency classes.
Simultaneous Dimensionality and Complexity Model Selection for Spectral Graph Clustering
Yang, Congyuan, Priebe, Carey E., Park, Youngser, Marchette, David J.
Our problem of interest is to cluster vertices of a graph by identifying its underlying community structure. Among various vertex clustering approaches, spectral clustering is one of the most popular methods, because it is easy to implement while often outperforming traditional clustering algorithms. However, there are two inherent model selection problems in spectral clustering, namely estimating the embedding dimension and number of clusters. This paper attempts to address the issue by establishing a novel model selection framework specifically for vertex clustering on graphs under a stochastic block model. The first contribution is a probabilistic model which approximates the distribution of the extended spectral embedding of a graph. The model is constructed based on a theoretical result of asymptotic normality of the informative part of the embedding, and on a simulation result of limiting behavior of the redundant part of the embedding. The second contribution is a simultaneous model selection framework. In contrast with the traditional approaches, our model selection procedure estimates embedding dimension and number of clusters simultaneously. Based on our proposed distributional model, a theorem on the consistency of the estimates of model parameters is stated and proven. The theorem provides a statistical support for the validity of our method. Heuristic algorithms via the simultaneous model selection framework for vertex clustering are proposed, with good performance shown in the experiment on synthetic data and on the real application of connectome analysis.