WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos

Wu, Zhaomin, Wang, Ziyang, He, Bingsheng

Oct-28-2025–arXiv.org Artificial Intelligence

Relational databases are often fragmented across organizations, creating data silos that hinder distributed data management and mining. Collaborative learning (CL) -- techniques that enable multiple parties to train models jointly without sharing raw data -- offers a principled approach to this challenge. However, existing CL frameworks (e.g., federated and split learning) remain limited in real-world deployments. Current CL benchmarks and algorithms primarily target the learning step under assumptions of isolated, aligned, and joinable databases, and they typically neglect the end-to-end data management pipeline, especially preprocessing steps such as table joins and data alignment. In contrast, our analysis of the real-world corpus WikiDBs shows that databases are interconnected, unaligned, and sometimes unjoinable, exposing a significant gap between CL algorithm design and practical deployment. To close this evaluation gap, we build WikiDBGraph, a large-scale dataset constructed from 100{,}000 real-world relational databases linked by 17 million weighted edges. Each node (database) and edge (relationship) is annotated with 13 and 12 properties, respectively, capturing a hybrid of instance- and feature-level overlap across databases. Experiments on WikiDBGraph demonstrate both the effectiveness and limitations of existing CL methods under realistic conditions, highlighting previously overlooked gaps in managing real-world data silos and pointing to concrete directions for practical deployment of collaborative learning systems.

data mining, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-28-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Singapore (0.14)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Information Technology > Security & Privacy (0.46)

Technology:
- Information Technology
  - Information Management (1.00)
  - Databases (1.00)
  - Communications (1.00)
  - Data Science > Data Mining
    - Big Data (0.46)
  - Artificial Intelligence
    - Natural Language (1.00)
    - Machine Learning > Statistical Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found