Median Selection Subset Aggregation for Parallel Inference

Wang, Xiangyu, Peng, Peichao, Dunson, David B.

Neural Information Processing Systems 

For massive data sets, efficient computation commonly relies on distributed algorithms that store and process subsets of the data on different machines, minimizing communication costs. Our focus is on regression and classification problems involving many features. A variety of distributed algorithms have been proposed in this context, but challenges arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. We propose a MEdian Selection Subset AGgregation Estimator (message) algorithm, which attempts to solve these problems. The algorithm applies feature selection in parallel for each subset using Lasso or another method, calculates the median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates.