Distributed Submodular Cover: Succinctly Summarizing Massive Data

Mirzasoleiman, Baharan, Karbasi, Amin, Badanidiyuru, Ashwinkumar, Krause, Andreas

Neural Information Processing Systems 

How can one find a subset, ideally as small as possible, that well represents a massive dataset? I.e., its corresponding utility, measured according to a suitable utility function, should be comparable to that of the whole dataset. Here, the utility is assumed to exhibit submodularity, a natural diminishing returns condition preva- lent in many data summarization applications. The classical greedy algorithm is known to provide solutions with logarithmic approximation guarantees compared to the optimum solution. However, this sequential, centralized approach is imprac- tical for truly large-scale problems.