reference graph
KGpipe: Generation and Evaluation of Pipelines for Data Integration into Knowledge Graphs
Building high-quality knowledge graphs (KGs) from diverse sources requires combining methods for information extraction, data transformation, ontology mapping, entity matching, and data fusion. Numerous methods and tools exist for each of these tasks, but support for combining them into reproducible and effective end-to-end pipelines is still lacking. We present a new framework, KGpipe for defining and executing integration pipelines that can combine existing tools or LLM (Large Language Model) functionality. To evaluate different pipelines and the resulting KGs, we propose a benchmark to integrate heterogeneous data of different formats (RDF, JSON, text) into a seed KG. We demonstrate the flexibility of KGpipe by running and comparatively evaluating several pipelines integrating sources of the same or different formats using selected performance and quality metrics.
Practical Causal Evaluation Metrics for Biological Networks
Sato, Noriaki, Scutari, Marco, Kawano, Shuichi, Yamaguchi, Rui, Imoto, Seiya
Estimating causal networks from biological data is a critical step in systems biology. When evaluating the inferred network, assessing the networks based on their intervention effects is particularly important for downstream probabilistic reasoning and the identification of potential drug targets. In the context of gene regulatory network inference, biological databases are often used as reference sources. These databases typically describe relationships in a qualitative rather than quantitative manner. However, few evaluation metrics have been developed that take this qualitative nature into account. To address this, we developed a metric, the sign-augmented Structural Intervention Distance (sSID), and a weighted sSID that incorporates the net effects of the intervention. Through simulations and analyses of real transcriptomic datasets, we found that our proposed metrics could identify a different algorithm as optimal compared to conventional metrics, and the network selected by sSID had a superior performance in the classification task of clinical covariates using transcriptomic data. This suggests that sSID can distinguish networks that are structurally correct but functionally incorrect, highlighting its potential as a more biologically meaningful and practical evaluation metric.
Unpacking Hateful Memes: Presupposed Context and False Claims
Cai, Weibin, Li, Jiayu, Zafarani, Reza
While memes are often humorous, they are frequently used to disseminate hate, causing serious harm to individuals and society. Current approaches to hateful meme detection mainly rely on pre-trained language models. However, less focus has been dedicated to \textit{what make a meme hateful}. Drawing on insights from philosophy and psychology, we argue that hateful memes are characterized by two essential features: a \textbf{presupposed context} and the expression of \textbf{false claims}. To capture presupposed context, we develop \textbf{PCM} for modeling contextual information across modalities. To detect false claims, we introduce the \textbf{FACT} module, which integrates external knowledge and harnesses cross-modal reference graphs. By combining PCM and FACT, we introduce \textbf{\textsf{SHIELD}}, a hateful meme detection framework designed to capture the fundamental nature of hate. Extensive experiments show that SHIELD outperforms state-of-the-art methods across datasets and metrics, while demonstrating versatility on other tasks, such as fake news detection.
It Takes a Graph to Know a Graph: Rewiring for Homophily with a Reference Graph
Mendelman, Harel, Maron, Haggai, Talmon, Ronen
Graph Neural Networks (GNNs) excel at analyzing graph-structured data but struggle on heterophilic graphs, where connected nodes often belong to different classes. While this challenge is commonly addressed with specialized GNN architectures, graph rewiring remains an underexplored strategy in this context. We provide theoretical foundations linking edge homophily, GNN embedding smoothness, and node classification performance, motivating the need to enhance homophily. Building on this insight, we introduce a rewiring framework that increases graph homophily using a reference graph, with theoretical guarantees on the homophily of the rewired graph. To broaden applicability, we propose a label-driven diffusion approach for constructing a homophilic reference graph from node features and training labels. Through extensive simulations, we analyze how the homophily of both the original and reference graphs influences the rewired graph homophily and downstream GNN performance. We evaluate our method on 11 real-world heterophilic datasets and show that it outperforms existing rewiring techniques and specialized GNNs for heterophilic graphs, achieving improved node classification accuracy while remaining efficient and scalable to large graphs.
Synthesizing Diverse Network Flow Datasets with Scalable Dynamic Multigraph Generation
Grayeli, Arya, Swarup, Vipin, Noel, Steven E.
Obtaining real-world network datasets is often challenging because of privacy, security, and computational constraints. In the absence of such datasets, graph generative models become essential tools for creating synthetic datasets. In this paper, we introduce a novel machine learning model for generating high-fidelity synthetic network flow datasets that are representative of real-world networks. Our approach involves the generation of dynamic multigraphs using a stochastic Kronecker graph generator for structure generation and a tabular generative adversarial network for feature generation. We further employ an XGBoost (eXtreme Gradient Boosting) model for graph alignment, ensuring accurate overlay of features onto the generated graph structure. We evaluate our model using new metrics that assess both the accuracy and diversity of the synthetic graphs. Our results demonstrate improvements in accuracy over previous large-scale graph generation methods while maintaining similar efficiency. We also explore the trade-off between accuracy and diversity in synthetic graph dataset creation, a topic not extensively covered in related works. Our contributions include the synthesis and evaluation of large real-world netflow datasets and the definition of new metrics for evaluating synthetic graph generative models.
A Theoretical Analysis of Compositional Generalization in Neural Networks: A Necessary and Sufficient Condition
Compositional generalization is a crucial property in artificial intelligence, enabling models to handle novel combinations of known components. While most deep learning models lack this capability, certain models succeed in specific tasks, suggesting the existence of governing conditions. This paper derives a necessary and sufficient condition for compositional generalization in neural networks. Conceptually, it requires that (i) the computational graph matches the true compositional structure, and (ii) components encode just enough information in training. The condition is supported by mathematical proofs. This criterion combines aspects of architecture design, regularization, and training data properties. A carefully designed minimal example illustrates an intuitive understanding of the condition. We also discuss the potential of the condition for assessing compositional generalization before training. This work is a fundamental theoretical study of compositional generalization in neural networks.
Can Graph Neural Networks Expose Training Data Properties? An Efficient Risk Assessment Approach
Yuan, Hanyang, Xu, Jiarong, Huang, Renhong, Song, Mingli, Wang, Chunping, Yang, Yang
Graph neural networks (GNNs) have attracted considerable attention due to their diverse applications. However, the scarcity and quality limitations of graph data present challenges to their training process in practical settings. To facilitate the development of effective GNNs, companies and researchers often seek external collaboration. Yet, directly sharing data raises privacy concerns, motivating data owners to train GNNs on their private graphs and share the trained models. Unfortunately, these models may still inadvertently disclose sensitive properties of their training graphs (e.g., average default rate in a transaction network), leading to severe consequences for data owners. In this work, we study graph property inference attack to identify the risk of sensitive property information leakage from shared models. Existing approaches typically train numerous shadow models for developing such attack, which is computationally intensive and impractical. To address this issue, we propose an efficient graph property inference attack by leveraging model approximation techniques. Our method only requires training a small set of models on graphs, while generating a sufficient number of approximated shadow models for attacks. To enhance diversity while reducing errors in the approximated models, we apply edit distance to quantify the diversity within a group of approximated models and introduce a theoretically guaranteed criterion to evaluate each model's error. Subsequently, we propose a novel selection mechanism to ensure that the retained approximated models achieve high diversity and low error. Extensive experiments across six real-world scenarios demonstrate our method's substantial improvement, with average increases of 2.7% in attack accuracy and 4.1% in ROC-AUC, while being 6.5$\times$ faster compared to the best baseline.