Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

Mirzasoleiman, Baharan, Karbasi, Amin, Sarkar, Rik, Krause, Andreas

Dec-31-2013–Neural Information Processing Systems

Many large-scale machine learning problems (such as clustering, nonparametric learning, kernel machines, etc.) require selecting, out of a massive data set, a manageable yet representative subset. Such problems can often be reduced to maximizing a submodular set function subject to cardinality constraints. Classical approaches require centralized access to the full data set; but for truly large-scale problems, rendering the data centrally is often impractical. In this paper, we consider theproblem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GREEDI, that is easily implemented using MapReducestyle computations. We theoretically analyze our approach, and show, that under certain natural conditions, performance close to the (impractical) centralized approach can be achieved. In our extensive experiments, we demonstrate theeffectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar-based clustering, on tens of millions of data points using Hadoop.

artificial intelligence, data mining, machine learning, (20 more...)

Neural Information Processing Systems

Dec-31-2013

Conferences PDF

Add feedback

Country:
- Europe (0.46)

Technology:
- Information Technology
  - Data Science > Data Mining
    - Big Data (0.49)
  - Artificial Intelligence
    - Machine Learning > Statistical Learning (0.66)
    - Representation & Reasoning > Optimization (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found