Lightspeed Geometric Dataset Distance via Sliced Optimal Transport
Nguyen, Khai, Nguyen, Hai, Pham, Tuan, Ho, Nhat
Dataset distances provide a powerful framework for comparing datasets based on their underlying structures, distributions, or content. These measures are essential in applications where understanding the relationships between datasets drives decision-making, such as assessing data quality, detecting distributional shifts, or quantifying biases. They play a critical role in machine learning workflows, enabling tasks like domain adaptation, transfer learning, continual learning, and fairness evaluation. Additionally, dataset distances are valuable in emerging areas such as synthetic data evaluation, 3D shape comparison, and federated learning, where comparing heterogeneous data distributions is fundamental. By capturing meaningful similarities and differences between datasets, these measures facilitate data-driven insights, enhance model robustness, and support novel applications across diverse fields. A common approach to comparing datasets relies on proxies, such as analyzing the learning curves of a predefined model [28, 16] or examining its optimal parameters [1, 22] on a given task. Another strategy involves making strong assumptions about the similarity or co-occurrence of labels between datasets [47]. However, these methods often lack theoretical guarantees, are heavily dependent on the choice of the probe model, and require training the model to completion (e.g., to identify optimal parameters) for each dataset under comparison. To address limitations of previous approaches, model-agnostic approaches are developed.
Jan-31-2025
- Country:
- North America
- Canada > Ontario
- Toronto (0.14)
- United States > Texas (0.14)
- Canada > Ontario
- North America
- Genre:
- Overview > Innovation (0.34)
- Research Report > New Finding (0.46)
- Technology: