Gang, Arpita
FAST-PCA: A Fast and Exact Algorithm for Distributed Principal Component Analysis
Gang, Arpita, Bajwa, Waheed U.
Principal Component Analysis (PCA) is a fundamental data preprocessing tool in the world of machine learning. While PCA is often thought of as a dimensionality reduction method, the purpose of PCA is actually two-fold: dimension reduction and uncorrelated feature learning. Furthermore, the enormity of the dimensions and sample size in the modern day datasets have rendered the centralized PCA solutions unusable. In that vein, this paper reconsiders the problem of PCA when data samples are distributed across nodes in an arbitrarily connected network. While a few solutions for distributed PCA exist, those either overlook the uncorrelated feature learning aspect of the PCA, tend to have high communication overhead that makes them inefficient and/or lack `exact' or `global' convergence guarantees. To overcome these aforementioned issues, this paper proposes a distributed PCA algorithm termed FAST-PCA (Fast and exAct diSTributed PCA). The proposed algorithm is efficient in terms of communication and is proven to converge linearly and exactly to the principal components, leading to dimension reduction as well as uncorrelated features. The claims are further supported by experimental results.
A Linearly Convergent Algorithm for Distributed Principal Component Analysis
Gang, Arpita, Bajwa, Waheed U.
Principal Component Analysis (PCA) is the workhorse tool for dimensionality reduction in this era of big data. While often overlooked, the purpose of PCA is not only to reduce data dimensionality, but also to yield features that are uncorrelated. This paper focuses on this dual objective of PCA, namely, dimensionality reduction and decorrelation of features, which requires estimating the eigenvectors of a data covariance matrix, as opposed to only estimating the subspace spanned by the eigenvectors. The ever-increasing volume of data in the modern world often requires storage of data samples across multiple machines, which precludes the use of centralized PCA algorithms. Although a few distributed solutions to the PCA problem have been proposed recently, convergence guarantees and/or communications overhead of these solutions remain a concern. With an eye towards communications efficiency, this paper introduces a feedforward neural network-based one time-scale distributed PCA algorithm termed Distributed Sanger's Algorithm (DSA) that estimates the eigenvectors of a data covariance matrix when data are distributed across an undirected and arbitrarily connected network of machines. Furthermore, the proposed algorithm is shown to converge linearly to a neighborhood of the true solution. Numerical results are also shown to demonstrate the efficacy of the proposed solution.
Adversary-resilient Inference and Machine Learning: From Distributed to Decentralized
Yang, Zhixiong, Gang, Arpita, Bajwa, Waheed U.
Statistical inference and machine learning algorithms have traditionally been developed for data available at a single location. Unlike this centralized setting, modern datasets are increasingly being distributed across multiple physical entities (sensors, devices, machines, data centers, etc.) for a multitude of reasons that range from storage, memory, and computational constraints to privacy concerns and engineering needs. This has necessitated the development of inference and learning algorithms capable of operating on non-collocated data. Such algorithms can be divided into two broad categories, namely, distributed algorithms and decentralized algorithms . Distributed algorithms correspond to the setup in which the data-bearing entities (henceforth referred to as "nodes") only communicate with a single entity (referred to as master node, central server, parameter server, fusion center, etc.), which is tasked with generating the final result. Such distributed setups arise in the context of parallel computing, where the focus is computational speedups and/or overcoming memory/storage bottlenecks, and federated systems, where "raw" data collected by individual nodes cannot be shared with the master node due to either communication constraints (e.g., sensor networks) or privacy concerns (e.g., smartphone data). Decentralized algorithms, on the other hand, correspond to the setup that lacks a central server; instead, individual nodes in this setup communicate among themselves over a network (often ad hoc) to reach a common solution (i.e., achieve consensus) at all nodes. Such decentralized setups arise either out of the need to eliminate single points of failure in distributed setups or due to practical constraints, as in the internet of things and autonomous systems. We refer the reader to Figure 1 for examples of distributed and decentralized setups.Is it distributed or is it decentralized? Inference and learning from non-collocated data have been studied for decades in computer science, control, signal processing, and statistics. Both among and within these disciplines, however, there is no consensus on use of the terms "distributed" and "decentralized." Though many works share the definitions provided in here, there are numerous authors who use these two terms interchangeably, while there are some other authors who reverse these definitions. Inference and machine learning algorithms involving non-collocated data are broadly divisible into the categories of ( i) distributed algorithms and ( ii) decentralized algorithms.