Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

Mirzasoleiman, Baharan, Karbasi, Amin, Sarkar, Rik, Krause, Andreas

Neural Information Processing Systems 

Many large-scale machine learning problems (such as clustering, non-parametric learning, kernel machines, etc.) require selecting, out of a massive data set, a manageable, representative subset. Such problems can often be reduced to maximizing a submodular set function subject to cardinality constraints. Classical approaches require centralized access to the full data set; but for truly large-scale problems, rendering the data centrally is often impractical. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GreeDI, that is easily implemented using MapReduce style computations.