Tarnawski, Jakub
Integrated Hardware Architecture and Device Placement Search
Wang, Irene, Tarnawski, Jakub, Phanishayee, Amar, Mahajan, Divya
Distributed execution of deep learning training involves a dynamic interplay between hardware accelerator architecture and device placement strategy. This is the first work to explore the co-optimization of determining the optimal architecture and device placement strategy through novel algorithms, improving the balance of computational resources, memory usage, and data distribution. Our architecture search leverages tensor and vector units, determining their quantity and dimensionality, and on-chip and off-chip memory configurations. It also determines the microbatch size and decides whether to recompute or stash activations, balancing the memory footprint of training and storage size. For each explored architecture configuration, we use an Integer Linear Program (ILP) to find the optimal schedule for executing operators on the accelerator. The ILP results then integrate with a dynamic programming solution to identify the most effective device placement strategy, combining data, pipeline, and tensor model parallelism across multiple accelerators. Our approach achieves higher throughput on large language models compared to the state-of-the-art TPUv4 and the Spotlight accelerator search framework. The entire source code of PHAZE is available at https://github.com/msr-fiddle/phaze.
Efficiently Computing Similarities to Private Datasets
Backurs, Arturs, Lin, Zinan, Mahabadi, Sepideh, Silwal, Sandeep, Tarnawski, Jakub
Many methods in differentially private model training rely on computing the similarity between a query point (such as public or synthetic data) and private data. We abstract out this common subroutine and study the following fundamental algorithmic problem: Given a similarity function $f$ and a large high-dimensional private dataset $X \subset \mathbb{R}^d$, output a differentially private (DP) data structure which approximates $\sum_{x \in X} f(x,y)$ for any query $y$. We consider the cases where $f$ is a kernel function, such as $f(x,y) = e^{-\|x-y\|_2^2/\sigma^2}$ (also known as DP kernel density estimation), or a distance function such as $f(x,y) = \|x-y\|_2$, among others. Our theoretical results improve upon prior work and give better privacy-utility trade-offs as well as faster query times for a wide range of kernels and distance functions. The unifying approach behind our results is leveraging `low-dimensional structures' present in the specific functions $f$ that we study, using tools such as provable dimensionality reduction, approximation theory, and one-dimensional decomposition of the functions. Our algorithms empirically exhibit improved query times and accuracy over prior state of the art. We also present an application to DP classification. Our experiments demonstrate that the simple methodology of classifying based on average similarity is orders of magnitude faster than prior DP-SGD based approaches for comparable accuracy.
Fairness in Submodular Maximization over a Matroid Constraint
Halabi, Marwa El, Tarnawski, Jakub, Norouzi-Fard, Ashkan, Vuong, Thuy-Duong
Machine learning algorithms are increasingly used in decision-making processes. This can potentially lead to the introduction or perpetuation of bias and discrimination in automated decisions. Of particular concern are sensitive areas such as education, hiring, credit access, bail decisions, and law enforcement (Munoz et al., 2016; White House OSTP, 2022; European Union FRA, 2022). There has been a growing body of work attempting to mitigate these risks by developing fair algorithms for fundamental problems including classification (Zafar et al., 2017), ranking(Celis et al., 2018c), clustering (Chierichetti et al., 2017), voting (Celis et al., 2018a), matching (Chierichetti et al., 2019), influence maximization (Tsang et al., 2019), data summarization (Celis et al., 2018b), and many others. In this work, we address fairness in the fundamental problem of submodular maximization over a matroid constraint, in the offline setting. Submodular functions model a diminishing returns property that naturally occurs in a variety of machine learning problems such as active learning (Golovin and Krause, 2011), data summarization (Lin and Bilmes, 2011), feature selection (Das and Kempe, 2011), and recommender systems (El-Arini and Guestrin, 2011). Matroids represent a popular and expressive notion of independence systems that encompasses a broad spectrum of useful constraints, e.g.
Fairness in Streaming Submodular Maximization over a Matroid Constraint
Halabi, Marwa El, Fusco, Federico, Norouzi-Fard, Ashkan, Tardos, Jakab, Tarnawski, Jakub
Streaming submodular maximization is a natural model for the task of selecting a representative subset from a large-scale dataset. If datapoints have sensitive attributes such as gender or race, it becomes important to enforce fairness to avoid bias and discrimination. This has spurred significant interest in developing fair machine learning algorithms. Recently, such algorithms have been developed for monotone submodular maximization under a cardinality constraint. In this paper, we study the natural generalization of this problem to a matroid constraint. We give streaming algorithms as well as impossibility results that provide trade-offs between efficiency, quality and fairness. We validate our findings empirically on a range of well-known real-world applications: exemplar-based clustering, movie recommendation, and maximum coverage in social networks.
Efficient Algorithms for Device Placement of DNN Graph Operators
Tarnawski, Jakub, Phanishayee, Amar, Devanur, Nikhil R., Mahajan, Divya, Paravecino, Fanny Nina
Modern machine learning workloads use large models, with complex structures, that are very expensive to execute. The devices that execute complex models are becoming increasingly heterogeneous as we see a flourishing of domain-specific accelerators being offered as hardware accelerators in addition to CPUs. These trends necessitate distributing the workload across multiple devices. Recent work has shown that significant gains can be obtained with model parallelism, i.e, partitioning a neural network's computational graph onto multiple devices. In particular, this form of parallelism assumes a pipeline of devices, which is fed a stream of samples and yields high throughput for training and inference of DNNs. However, for such settings (large models and multiple heterogeneous devices), we require automated algorithms and toolchains that can partition the ML workload across devices. In this paper, we identify and isolate the structured optimization problem at the core of device placement of DNN operators, for both inference and training, especially in modern pipelined settings. We then provide algorithms that solve this problem to optimality. We demonstrate the applicability and efficiency of our approaches using several contemporary DNN computation graphs.
Beyond $1/2$-Approximation for Submodular Maximization on Massive Data Streams
Norouzi-Fard, Ashkan, Tarnawski, Jakub, Mitrović, Slobodan, Zandieh, Amir, Mousavifar, Aida, Svensson, Ola
Many tasks in machine learning and data mining, such as data diversification, non-parametric learning, kernel machines, clustering etc., require extracting a small but representative summary from a massive dataset. Often, such problems can be posed as maximizing a submodular set function subject to a cardinality constraint. We consider this question in the streaming setting, where elements arrive over time at a fast pace and thus we need to design an efficient, low-memory algorithm. One such method, proposed by Badanidiyuru et al. (2014), always finds a $0.5$-approximate solution. Can this approximation factor be improved? We answer this question affirmatively by designing a new algorithm SALSA for streaming submodular maximization. It is the first low-memory, single-pass algorithm that improves the factor $0.5$, under the natural assumption that elements arrive in a random order. We also show that this assumption is necessary, i.e., that there is no such algorithm with better than $0.5$-approximation when elements arrive in arbitrary order. Our experiments demonstrate that SALSA significantly outperforms the state of the art in applications related to exemplar-based clustering, social graph analysis, and recommender systems.
Streaming Robust Submodular Maximization: A Partitioned Thresholding Approach
Mitrović, Slobodan, Bogunovic, Ilija, Norouzi-Fard, Ashkan, Tarnawski, Jakub, Cevher, Volkan
We study the classical problem of maximizing a monotone submodular function subject to a cardinality constraint k, with two additional twists: (i) elements arrive in a streaming fashion, and (ii) m items from the algorithm's memory are removed after the stream is finished. We develop a robust submodular algorithm STAR-T. It is based on a novel partitioning structure and an exponentially decreasing thresholding rule. STAR-T makes one pass over the data and retains a short but robust summary. We show that after the removal of any m elements from the obtained summary, a simple greedy algorithm STAR-T-GREEDY that runs on the remaining elements achieves a constant-factor approximation guarantee. In two different data summarization tasks, we demonstrate that it matches or outperforms existing greedy and streaming methods, even if they are allowed the benefit of knowing the removed subset in advance.