GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

Fahrbach, Matthew, Ramalingam, Srikumar, Zadimoghaddam, Morteza, Ahmadian, Sara, Citovsky, Gui, DeSalvo, Giulia

May-29-2024–arXiv.org Artificial Intelligence

Subset selection is a challenging optimization problem with a wide variety of applications in machine learning, including feature selection, recommender systems, news aggregation, drug discovery, data summarization, and designing pretraining sets for large language models (Anil et al., 2023). Data sampling in particular is a salient problem due to unprecedented and continuous data collection. For example, LiDAR and imaging devices in one self-driving vehicle can easily capture ~80 terabytes of data per day (Kazhamiaka et al., 2021). In most subset selection tasks, we rely on the weight (or utility) of the objects to rank one over the other, and also to avoid selecting duplicate or near-duplicate objects. If we select a small subset, then we also want to ensure that the selected subset is a good representation of the original set. These utility, diversity, and coverage criteria can be expressed through objective functions, and the interesting research lies in developing efficient algorithms with strong approximation guarantees. The underlying machinery used in constrained subset selection algorithms shares many similarities with techniques from other areas of combinatorial optimization such as submodular maximization, -center clustering, and convex hull approximations. In this work, we study the problem of selecting a set of points in a metric space that maximizes an objective that combines their utility and a minimum pairwise-distance diversity measure.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

May-29-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.46)
  - Representation & Reasoning > Optimization (0.48)
  - Robots > Autonomous Vehicles (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found