Lightspeed Geometric Dataset Distance via Sliced Optimal Transport

Nguyen, Khai, Nguyen, Hai, Pham, Tuan, Ho, Nhat

Jan-31-2025–arXiv.org Machine Learning

Dataset distances provide a powerful framework for comparing datasets based on their underlying structures, distributions, or content. These measures are essential in applications where understanding the relationships between datasets drives decision-making, such as assessing data quality, detecting distributional shifts, or quantifying biases. They play a critical role in machine learning workflows, enabling tasks like domain adaptation, transfer learning, continual learning, and fairness evaluation. Additionally, dataset distances are valuable in emerging areas such as synthetic data evaluation, 3D shape comparison, and federated learning, where comparing heterogeneous data distributions is fundamental. By capturing meaningful similarities and differences between datasets, these measures facilitate data-driven insights, enhance model robustness, and support novel applications across diverse fields. A common approach to comparing datasets relies on proxies, such as analyzing the learning curves of a predefined model [28, 16] or examining its optimal parameters [1, 22] on a given task. Another strategy involves making strong assumptions about the similarity or co-occurrence of labels between datasets [47]. However, these methods often lack theoretical guarantees, are heavily dependent on the choice of the probe model, and require training the model to completion (e.g., to identify optimal parameters) for each dataset under comparison. To address limitations of previous approaches, model-agnostic approaches are developed.

artificial intelligence, machine learning, projection, (15 more...)

arXiv.org Machine Learning

Jan-31-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - United States > Texas
    - Travis County > Austin (0.04)
  - Canada > Ontario
    - Toronto (0.14)
- Asia
  - Middle East > Israel (0.04)
  - Vietnam > Hanoi
    - Hanoi (0.04)

Genre:
- Research Report > New Finding (0.46)
- Overview > Innovation (0.34)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found