GIST: Greedy Independent Set Thresholding for Diverse Data Summarization
Fahrbach, Matthew, Ramalingam, Srikumar, Zadimoghaddam, Morteza, Ahmadian, Sara, Citovsky, Gui, DeSalvo, Giulia
–arXiv.org Artificial Intelligence
Subset selection is a challenging optimization problem with a wide variety of applications in machine learning, including feature selection, recommender systems, news aggregation, drug discovery, data summarization, and designing pretraining sets for large language models (Anil et al., 2023). Data sampling in particular is a salient problem due to unprecedented and continuous data collection. For example, LiDAR and imaging devices in one self-driving vehicle can easily capture ~80 terabytes of data per day (Kazhamiaka et al., 2021). In most subset selection tasks, we rely on the weight (or utility) of the objects to rank one over the other, and also to avoid selecting duplicate or near-duplicate objects. If we select a small subset, then we also want to ensure that the selected subset is a good representation of the original set. These utility, diversity, and coverage criteria can be expressed through objective functions, and the interesting research lies in developing efficient algorithms with strong approximation guarantees. The underlying machinery used in constrained subset selection algorithms shares many similarities with techniques from other areas of combinatorial optimization such as submodular maximization, -center clustering, and convex hull approximations. In this work, we study the problem of selecting a set of points in a metric space that maximizes an objective that combines their utility and a minimum pairwise-distance diversity measure.
arXiv.org Artificial Intelligence
May-29-2024