AITopics | Shen, Cencheng

Collaborating Authors

Shen, Cencheng

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Explaining Categorical Feature Interactions Using Graph Covariance and LLMs

Shen, Cencheng, Edge, Darren, Larson, Jonathan, Priebe, Carey E.

arXiv.org Machine LearningJan-24-2025

Modern datasets often consist of numerous samples with abundant features and associated timestamps. Analyzing such datasets to uncover underlying events typically requires complex statistical methods and substantial domain expertise. A notable example, and the primary data focus of this paper, is the global synthetic dataset from the Counter Trafficking Data Collaborative (CTDC) -- a global hub of human trafficking data containing over 200,000 anonymized records spanning from 2002 to 2022, with numerous categorical features for each record. In this paper, we propose a fast and scalable method for analyzing and extracting significant categorical feature interactions, and querying large language models (LLMs) to generate data-driven insights that explain these interactions. Our approach begins with a binarization step for categorical features using one-hot encoding, followed by the computation of graph covariance at each time. This graph covariance quantifies temporal changes in dependence structures within categorical data and is established as a consistent dependence measure under the Bernoulli distribution. We use this measure to identify significant feature pairs, such as those with the most frequent trends over time or those exhibiting sudden spikes in dependence at specific moments. These extracted feature pairs, along with their timestamps, are subsequently passed to an LLM tasked with generating potential explanations of the underlying events driving these dependence changes. The effectiveness of our method is demonstrated through extensive simulations, and its application to the CTDC dataset reveals meaningful feature pairs and potential data stories underlying the observed feature interactions.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Machine Learning

2501.14932

Country:

North America > United States (1.00)
Asia (1.00)
Africa (0.68)
Europe > United Kingdom > England (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.68)
Law > Civil Rights & Constitutional Law (0.67)
Law > Labor & Employment Law (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Principal Graph Encoder Embedding and Principal Community Detection

Shen, Cencheng, Dong, Yuexiao, Priebe, Carey E., Larson, Jonathan, Trinh, Ha, Park, Youngser

arXiv.org Machine LearningJan-24-2025

In this paper, we introduce the concept of principal communities and propose a principal graph encoder embedding method that concurrently detects these communities and achieves vertex embedding. Given a graph adjacency matrix with vertex labels, the method computes a sample community score for each community, ranking them to measure community importance and estimate a set of principal communities. The method then produces a vertex embedding by retaining only the dimensions corresponding to these principal communities. Theoretically, we define the population version of the encoder embedding and the community score based on a random Bernoulli graph distribution. We prove that the population principal graph encoder embedding preserves the conditional density of the vertex labels and that the population community score successfully distinguishes the principal communities. We conduct a variety of simulations to demonstrate the finite-sample accuracy in detecting ground-truth principal communities, as well as the advantages in embedding visualization and subsequent vertex classification. The method is further applied to a set of real-world graphs, showcasing its numerical advantages, including robustness to label noise and computational scalability.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

2501.14939

Country: North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Efficient Graph Encoder Embedding for Large Sparse Graphs in Python

Qin, Xihan, Shen, Cencheng

arXiv.org Artificial IntelligenceJun-5-2024

Graph is a ubiquitous representation of data in various research fields, and graph embedding is a prevalent machine learning technique for capturing key features and generating fixed-sized attributes. However, most state-of-the-art graph embedding methods are computationally and spatially expensive. Recently, the Graph Encoder Embedding (GEE) has been shown as the fastest graph embedding technique and is suitable for a variety of network data applications. As real-world data often involves large and sparse graphs, the huge sparsity usually results in redundant computations and storage. To address this issue, we propose an improved version of GEE, sparse GEE, which optimizes the calculation and storage of zero entries in sparse matrices to enhance the running time further. Our experiments demonstrate that the sparse version achieves significant speedup compared to the original GEE with Python implementation for large sparse graphs, and sparse GEE is capable of processing millions of edges within minutes on a standard laptop.

artificial intelligence, machine learning, matrix, (17 more...)

arXiv.org Artificial Intelligence

2406.03726

Country: North America > United States (0.47)

Genre: Research Report (0.64)

Industry: Information Technology (0.35)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Fast and Scalable Multi-Kernel Encoder Classifier

Shen, Cencheng

arXiv.org Artificial IntelligenceJun-4-2024

This paper introduces a new kernel-based classifier by viewing kernel matrices as generalized graphs and leveraging recent progress in graph embedding techniques. The proposed method facilitates fast and scalable kernel matrix embedding, and seamlessly integrates multiple kernels to enhance the learning process. Our theoretical analysis offers a population-level characterization of this approach using random variables. Empirically, our method demonstrates superior running time compared to standard approaches such as support vector machines and two-layer neural network, while achieving comparable classification accuracy across various simulated and real datasets.

artificial intelligence, machine learning, matrix, (15 more...)

arXiv.org Artificial Intelligence

2406.02189

Country: North America > United States > Delaware > New Castle County > Newark (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.56)

Add feedback

Encoder Embedding for General Graph and Node Classification

Shen, Cencheng

arXiv.org Machine LearningMay-24-2024

Graph encoder embedding, a recent technique for graph data, offers speed and scalability in producing vertex-level representations from binary graphs. In this paper, we extend the applicability of this method to a general graph model, which includes weighted graphs, distance matrices, and kernel matrices. We prove that the encoder embedding satisfies the law of large numbers and the central limit theorem on a per-observation basis. Under certain condition, it achieves asymptotic normality on a per-class basis, enabling optimal classification through discriminant analysis. These theoretical findings are validated through a series of experiments involving weighted graphs, as well as text and image data transformed into general graph representations using appropriate distance metrics.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

2405.15473

Country: North America > United States (0.46)

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Refined Graph Encoder Embedding via Self-Training and Latent Community Recovery

Shen, Cencheng, Larson, Jonathan, Trinh, Ha, Priebe, Carey E.

arXiv.org Machine LearningMay-21-2024

This paper introduces a refined graph encoder embedding method, enhancing the original graph encoder embedding using linear transformation, self-training, and hidden community recovery within observed communities. We provide the theoretical rationale for the refinement procedure, demonstrating how and why our proposed method can effectively identify useful hidden communities via stochastic block models, and how the refinement method leads to improved vertex embedding and better decision boundaries for subsequent vertex classification. The efficacy of our approach is validated through a collection of simulated and real-world graph data.

artificial intelligence, graph, machine learning, (18 more...)

arXiv.org Machine Learning

2405.12797

Country: North America > United States > Delaware > New Castle County > Newark (0.14)

Genre: Research Report (0.40)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Edge-Parallel Graph Encoder Embedding

Lubonja, Ariel, Shen, Cencheng, Priebe, Carey, Burns, Randal

arXiv.org Artificial IntelligenceFeb-6-2024

New algorithms for embedding graphs have reduced the asymptotic complexity of finding low-dimensional representations. One-Hot Graph Encoder Embedding (GEE) uses a single, linear pass over edges and produces an embedding that converges asymptotically to the spectral embedding. The scaling and performance benefits of this approach have been limited by a serial implementation in an interpreted language. We refactor GEE into a parallel program in the Ligra graph engine that maps functions over the edges of the graph and uses lock-free atomic instrutions to prevent data races. On a graph with 1.8B edges, this results in a 500 times speedup over the original implementation and a 17 times speedup over a just-in-time compiled version.

artificial intelligence, data mining, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2402.04403

Country: North America > United States (0.14)

Genre: Research Report (0.40)

Industry: Information Technology (0.69)

Technology:

Information Technology > Data Science > Data Mining (0.97)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Discovering Communication Pattern Shifts in Large-Scale Labeled Networks using Encoder Embedding and Vertex Dynamics

Shen, Cencheng, Larson, Jonathan, Trinh, Ha, Qin, Xihan, Park, Youngser, Priebe, Carey E.

arXiv.org Machine LearningNov-29-2023

Analyzing large-scale time-series network data, such as social media and email communications, poses a significant challenge in understanding social dynamics, detecting anomalies, and predicting trends. In particular, the scalability of graph analysis is a critical hurdle impeding progress in large-scale downstream inference. To address this challenge, we introduce a temporal encoder embedding method. This approach leverages ground-truth or estimated vertex labels, enabling an efficient embedding of large-scale graph data and the processing of billions of edges within minutes. Furthermore, this embedding unveils a temporal dynamic statistic capable of detecting communication pattern shifts across all levels, ranging from individual vertices to vertex communities and the overall graph structure. We provide theoretical support to confirm its soundness under random graph models, and demonstrate its numerical advantages in capturing evolving communities and identifying outliers. Finally, we showcase the practical application of our approach by analyzing an anonymized time-series communication network from a large organization spanning 2019-2020, enabling us to assess the impact of Covid-19 on workplace communication patterns.

data mining, machine learning, vertex, (15 more...)

arXiv.org Machine Learning

2305.02381

Country: North America > United States (0.67)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.54)
Health & Medicine > Therapeutic Area > Immunology (0.54)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

Learning sources of variability from high-dimensional observational studies

Bridgeford, Eric W., Chung, Jaewon, Gilbert, Brian, Panda, Sambit, Li, Adam, Shen, Cencheng, Badea, Alexandra, Caffo, Brian, Vogelstein, Joshua T.

arXiv.org Machine LearningNov-28-2023

Causal inference studies whether the presence of a variable influences an observed outcome. As measured by quantities such as the "average treatment effect," this paradigm is employed across numerous biological fields, from vaccine and drug development to policy interventions. Unfortunately, the majority of these methods are often limited to univariate outcomes. Our work generalizes causal estimands to outcomes with any number of dimensions or any measurable space, and formulates traditional causal estimands for nominal variables as causal discrepancy tests. We propose a simple technique for adjusting universally consistent conditional independence tests and prove that these tests are universally consistent causal discrepancy tests. Numerical experiments illustrate that our method, Causal CDcorr, leads to improvements in both finite sample validity and power when compared to existing strategies. Our methods are all open source and available at github.com/ebridge2/cdcorr.

artificial intelligence, discrepancy, machine learning, (17 more...)

arXiv.org Machine Learning

2307.13868

Country:

North America > United States (1.00)
Europe > United Kingdom > England (0.14)
Europe > Austria > Vienna (0.14)

Genre:

Research Report > Experimental Study (1.00)
Research Report > Strength High (0.67)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection

Shen, Cencheng, Park, Youngser, Priebe, Carey E.

arXiv.org Machine LearningNov-18-2023

Typically, a graph (or network) is represented by an adjacency matrix A of size, where A(,) denotes the edge weight between the th and th vertices. Alternatively, the graph can be stored in an edgelist E of size 3, with the first two columns indicating the vertex indices of each edge and the last column representing the edge weight. Community detection, also known as vertex clustering or graph partitioning, is a fundamental problem in graph analysis [6, 8, 10, 13]. The primary objective is to identify natural groups of vertices where intra-group connections are stronger than inter-group connections. Over the years, various approaches have been proposed, including modularitybased methods [2, 22], spectral-based methods [15, 21], and likelihood-based techniques [1, 7], among others.

data mining, machine learning, vertex, (11 more...)

arXiv.org Machine Learning

doi: 10.1145/3625403.3625407

2301.1129

Country: North America > United States (0.47)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.97)

Add feedback