Faloutsos, Christos
GLEMOS: Benchmark for Instantaneous Graph Learning Model Selection
Park, Namyong, Rossi, Ryan, Wang, Xing, Simoulin, Antoine, Ahmed, Nesreen, Faloutsos, Christos
The choice of a graph learning (GL) model (i.e., a GL algorithm and its hyperparameter settings) has a significant impact on the performance of downstream tasks. However, selecting the right GL model becomes increasingly difficult and time consuming as more and more GL models are developed. Accordingly, it is of great significance and practical value to equip users of GL with the ability to perform a near-instantaneous selection of an effective GL model without manual intervention. Despite the recent attempts to tackle this important problem, there has been no comprehensive benchmark environment to evaluate the performance of GL model selection methods. To bridge this gap, we present GLEMOS in this work, a comprehensive benchmark for instantaneous GL model selection that makes the following contributions.
NetInfoF Framework: Measuring and Exploiting Network Usable Information
Lee, Meng-Chieh, Yu, Haiyang, Zhang, Jian, Ioannidis, Vassilis N., Song, Xiang, Adeshina, Soji, Zheng, Da, Faloutsos, Christos
Given a node-attributed graph, and a graph task (link prediction or node classification), can we tell if a graph neural network (GNN) will perform well? More specifically, do the graph structure and the node features carry enough usable information for the task? Our goals are (1) to develop a fast tool to measure how much information is in the graph structure and in the node features, and (2) to exploit the information to solve the task, if there is enough. We propose NetInfoF, a framework including NetInfoF_Probe and NetInfoF_Act, for the measurement and the exploitation of network usable information (NUI), respectively. Given a graph data, NetInfoF_Probe measures NUI without any model training, and NetInfoF_Act solves link prediction and node classification, while two modules share the same backbone. In summary, NetInfoF has following notable advantages: (a) General, handling both link prediction and node classification; (b) Principled, with theoretical guarantee and closed-form solution; (c) Effective, thanks to the proposed adjustment to node similarity; (d) Scalable, scaling linearly with the input size. In our carefully designed synthetic datasets, NetInfoF correctly identifies the ground truth of NUI and is the only method being robust to all graph scenarios. Applied on real-world datasets, NetInfoF wins in 11 out of 12 times on link prediction compared to general GNN baselines.
McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets
Vinces, Braulio V. Sรกnchez, Cordeiro, Robson L. F., Faloutsos, Christos
How could we have an outlier detector that works even with nondimensional data, and ranks together both singleton microclusters ('one-off' outliers) and nonsingleton microclusters by their anomaly scores? How to obtain scores that are principled in one scalable and 'hands-off' manner? Microclusters of outliers indicate coalition or repetition in fraud activities, etc.; their identification is thus highly desirable. This paper presents McCatch: a new algorithm that detects microclusters by leveraging our proposed 'Oracle' plot (1NN Distance versus Group 1NN Distance). We study 31 real and synthetic datasets with up to 1M data elements to show that McCatch is the only method that answers both of the questions above; and, it outperforms 11 other methods, especially when the data has nonsingleton microclusters or is nondimensional. We also showcase McCatch's ability to detect meaningful microclusters in graphs, fingerprints, logs of network connections, text data, and satellite imagery. For example, it found a 30-elements microcluster of confirmed 'Denial of Service' attacks in the network logs, taking only ~3 minutes for 222K data elements on a stock desktop.
Automatic Question-Answer Generation for Long-Tail Knowledge
Kumar, Rohan, Kim, Youngmin, Ravi, Sunitha, Sun, Haitian, Faloutsos, Christos, Salakhutdinov, Ruslan, Yoon, Minji
Pretrained Large Language Models (LLMs) have gained significant attention for addressing open-domain Question Answering (QA). While they exhibit high accuracy in answering questions related to common knowledge, LLMs encounter difficulties in learning about uncommon long-tail knowledge (tail entities). Since manually constructing QA datasets demands substantial human resources, the types of existing QA datasets are limited, leaving us with a scarcity of datasets to study the performance of LLMs on tail entities. In this paper, we propose an automatic approach to generate specialized QA datasets for tail entities and present the associated research challenges. We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets, comparing their performance with and without external resources including Wikipedia and Wikidata knowledge graphs.
EBV: Electronic Bee-Veterinarian for Principled Mining and Forecasting of Honeybee Time Series
Hossain, Mst. Shamima, Faloutsos, Christos, Baer, Boris, Kim, Hyoseung, Tsotras, Vassilis J.
Honeybees are vital for pollination and food production. Among many factors, extreme temperature (e.g., due to climate change) is particularly dangerous for bee health. Anticipating such extremities would allow beekeepers to take early preventive action. Thus, given sensor (temperature) time series data from beehives, how can we find patterns and do forecasting? Forecasting is crucial as it helps spot unexpected behavior and thus issue warnings to the beekeepers. In that case, what are the right models for forecasting? ARIMA, RNNs, or something else? We propose the EBV (Electronic Bee-Veterinarian) method, which has the following desirable properties: (i) principled: it is based on a) diffusion equations from physics and b) control theory for feedback-loop controllers; (ii) effective: it works well on multiple, real-world time sequences, (iii) explainable: it needs only a handful of parameters (e.g., bee strength) that beekeepers can easily understand and trust, and (iv) scalable: it performs linearly in time. We applied our method to multiple real-world time sequences, and found that it yields accurate forecasting (up to 49% improvement in RMSE compared to baselines), and segmentation. Specifically, discontinuities detected by EBV mostly coincide with domain expert's opinions, showcasing our approach's potential and practical feasibility. Moreover, EBV is scalable and fast, taking about 20 minutes on a stock laptop for reconstructing two months of sensor data.
Which Examples to Annotate for In-Context Learning? Towards Effective and Efficient Selection
Mavromatis, Costas, Srinivasan, Balasubramaniam, Shen, Zhengyuan, Zhang, Jiani, Rangwala, Huzefa, Faloutsos, Christos, Karypis, George
Large Language Models (LLMs) can adapt to new tasks via in-context learning (ICL). ICL is efficient as it does not require any parameter updates to the trained LLM, but only few annotated examples as input for the LLM. In this work, we investigate an active learning approach for ICL, where there is a limited budget for annotating examples. We propose a model-adaptive optimization-free algorithm, termed AdaICL, which identifies examples that the model is uncertain about, and performs semantic diversity-based example selection. Diversity-based sampling improves overall effectiveness, while uncertainty sampling improves budget efficiency and helps the LLM learn new information. Moreover, AdaICL poses its sampling strategy as a Maximum Coverage problem, that dynamically adapts based on the model's feedback and can be approximately solved via greedy algorithms. Extensive experiments on nine datasets and seven LLMs show that AdaICL improves performance by 4.4% accuracy points over SOTA (7.7% relative improvement), is up to 3x more budget-efficient than performing annotations uniformly at random, while it outperforms SOTA with 2x fewer ICL examples.
Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space
Zhang, Hengrui, Zhang, Jiani, Srinivasan, Balasubramaniam, Shen, Zhengyuan, Qin, Xiao, Faloutsos, Christos, Rangwala, Huzefa, Karypis, George
Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces TABSYN, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed TABSYN include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods. Extensive experiments on six datasets with five metrics demonstrate that TABSYN outperforms existing methods. Specifically, it reduces the error rates by 86% and 67% for column-wise distribution and pair-wise column correlation estimations compared with the most competitive baselines.
TouchUp-G: Improving Feature Representation through Graph-Centric Finetuning
Zhu, Jing, Song, Xiang, Ioannidis, Vassilis N., Koutra, Danai, Faloutsos, Christos
How can we enhance the node features acquired from Pretrained Models (PMs) to better suit downstream graph learning tasks? Graph Neural Networks (GNNs) have become the state-of-the-art approach for many high-impact, real-world graph applications. For feature-rich graphs, a prevalent practice involves utilizing a PM directly to generate features, without incorporating any domain adaptation techniques. Nevertheless, this practice is suboptimal because the node features extracted from PM are graph-agnostic and prevent GNNs from fully utilizing the potential correlations between the graph structure and node features, leading to a decline in GNNs performance. In this work, we seek to improve the node features obtained from a PM for downstream graph tasks and introduce TOUCHUP-G, which has several advantages. It is (a) General: applicable to any downstream graph task, including link prediction which is often employed in recommender systems; (b) Multi-modal: able to improve raw features of any modality (e.g. images, texts, audio); (c) Principled: it is closely related to a novel metric, feature homophily, which we propose to quantify the potential correlations between the graph structure and node features and we show that TOUCHUP-G can effectively shrink the discrepancy between the graph structure and node features; (d) Effective: achieving state-of-the-art results on four real-world datasets spanning different tasks and modalities.
Discovery and Exploitation of Generalized Network Effects
Lee, Meng-Chieh, Shekhar, Shubhranshu, Yoo, Jaemin, Faloutsos, Christos
Given a large graph with few node labels, how can we (a) identify whether there is generalized network-effects (GNE) of the graph or not, (b) estimate GNE to explain the interrelations among node classes, and (c) exploit GNE to improve downstream tasks such as predicting the unknown labels accurately and efficiently? The knowledge of GNE is valuable for various tasks like node classification and targeted advertising. However, identifying and understanding GNE such as homophily, heterophily or their combination is challenging in real-world graphs due to limited availability of node labels and noisy edges. We propose NetEffect, a graph mining approach to address the above issues, enjoying the following properties: (i) Principled: a statistical test to determine the presence of GNE in a graph with few node labels; (ii) General and Explainable: a closed-form solution to estimate the specific type of GNE observed; and (iii) Accurate and Scalable: the integration of GNE for accurate and fast node classification. Applied on public, real-world graphs, NetEffect discovers the unexpected absence of GNE in numerous graphs, which previously thought to exhibit heterophily. Further, we show that incorporating GNE is effective on node classification. On a large real-world graph with 1.6M nodes and 22.3M edges, NetEffect achieves over 7 times speedup (14 minutes vs. 2 hours) compared to most competitors.
Concept2Box: Joint Geometric Embeddings for Learning Two-View Knowledge Graphs
Huang, Zijie, Wang, Daheng, Huang, Binxuan, Zhang, Chenwei, Shang, Jingbo, Liang, Yan, Wang, Zhengyang, Li, Xian, Faloutsos, Christos, Sun, Yizhou, Wang, Wei
Knowledge graph embeddings (KGE) have been extensively studied to embed large-scale relational data for many real-world applications. Existing methods have long ignored the fact many KGs contain two fundamentally different views: high-level ontology-view concepts and fine-grained instance-view entities. They usually embed all nodes as vectors in one latent space. However, a single geometric representation fails to capture the structural differences between two views and lacks probabilistic semantics towards concepts' granularity. We propose Concept2Box, a novel approach that jointly embeds the two views of a KG using dual geometric representations. We model concepts with box embeddings, which learn the hierarchy structure and complex relations such as overlap and disjoint among them. Box volumes can be interpreted as concepts' granularity. Different from concepts, we model entities as vectors. To bridge the gap between concept box embeddings and entity vector embeddings, we propose a novel vector-to-box distance metric and learn both embeddings jointly. Experiments on both the public DBpedia KG and a newly-created industrial KG showed the effectiveness of Concept2Box.