WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos
Wu, Zhaomin, Wang, Ziyang, He, Bingsheng
–arXiv.org Artificial Intelligence
Relational databases are often fragmented across organizations, creating data silos that hinder distributed data management and mining. Collaborative learning (CL) -- techniques that enable multiple parties to train models jointly without sharing raw data -- offers a principled approach to this challenge. However, existing CL frameworks (e.g., federated and split learning) remain limited in real-world deployments. Current CL benchmarks and algorithms primarily target the learning step under assumptions of isolated, aligned, and joinable databases, and they typically neglect the end-to-end data management pipeline, especially preprocessing steps such as table joins and data alignment. In contrast, our analysis of the real-world corpus WikiDBs shows that databases are interconnected, unaligned, and sometimes unjoinable, exposing a significant gap between CL algorithm design and practical deployment. To close this evaluation gap, we build WikiDBGraph, a large-scale dataset constructed from 100{,}000 real-world relational databases linked by 17 million weighted edges. Each node (database) and edge (relationship) is annotated with 13 and 12 properties, respectively, capturing a hybrid of instance- and feature-level overlap across databases. Experiments on WikiDBGraph demonstrate both the effectiveness and limitations of existing CL methods under realistic conditions, highlighting previously overlooked gaps in managing real-world data silos and pointing to concrete directions for practical deployment of collaborative learning systems.
arXiv.org Artificial Intelligence
Oct-28-2025
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Information Technology > Security & Privacy (0.46)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Statistical Learning (0.68)
- Natural Language (1.00)
- Communications (1.00)
- Data Science > Data Mining
- Big Data (0.46)
- Databases (1.00)
- Information Management (1.00)
- Artificial Intelligence
- Information Technology