Goto

Collaborating Authors

 Lattanzi, Silvio


The Cost of Consistency: Submodular Maximization with Constant Recourse

arXiv.org Machine Learning

In this work, we study online submodular maximization, and how the requirement of maintaining a stable solution impacts the approximation. In particular, we seek bounds on the best-possible approximation ratio that is attainable when the algorithm is allowed to make at most a constant number of updates per step. We show a tight information-theoretic bound of $\tfrac{2}{3}$ for general monotone submodular functions, and an improved (also tight) bound of $\tfrac{3}{4}$ for coverage functions. Since both these bounds are attained by non poly-time algorithms, we also give a poly-time randomized algorithm that achieves a $0.51$-approximation. Combined with an information-theoretic hardness of $\tfrac{1}{2}$ for deterministic algorithms from prior work, our work thus shows a separation between deterministic and randomized algorithms, both information theoretically and for poly-time algorithms.


Dynamic Correlation Clustering in Sublinear Update Time

arXiv.org Artificial Intelligence

Clustering is a cornerstone of contemporary machine learning and data analysis. A successful clustering algorithm partitions data elements so that similar items reside within the same group, while dissimilar items are separated. Introduced in 2004 by Bansal, Blum and Chawla Bansal et al. ((2004)), the correlation clustering objective offers a natural approach to model this problem. Due to its concise and elegant formulation, this problem has drawn significant interest from researchers and practitioners, leading to applications across diverse domains. These include ensemble clustering identification ((Bonchi et al., 2013)), duplicate detection ((Arasu et al., 2009)), community mining ((Chen et al., 2012)), disambiguation tasks ((Kalashnikov et al., 2008)), automated labeling ((Agrawal et al., 2009; Chakrabarti et al., 2008)), and many more. In the correlation clustering problem we are given a graph where each edge has either a positive or negative label, and where a positive edge (u, v) indicates that u, v are similar elements (and a negative edge (u, v) indicates that u, v are dissimilar), the objective is to compute a partition of the graph that minimizes the number of negative edges within clusters plus positive edges between clusters. Since the problem is NP-hard, researchers have focused on designing approximation algorithms. The algorithm proposed by Cao et al. ((2024)) achieves an approximation ratio of 1.43 + ϵ, improving upon the previous 1.73 + ϵ and 1.994 + ϵ achieved by Cohen-Addad et al. ((2023, 2022b)). Prior to these developments, the best approximation guarantee of 2.06 was attained by the algorithm of Chawla et al. ((2015)).


Multi-View Stochastic Block Models

arXiv.org Machine Learning

Graph clustering is a central topic in unsupervised learning with a multitude of practical applications. In recent years, multi-view graph clustering has gained a lot of attention for its applicability to real-world instances where one has access to multiple data sources. In this paper we formalize a new family of models, called \textit{multi-view stochastic block models} that captures this setting. For this model, we first study efficient algorithms that naively work on the union of multiple graphs. Then, we introduce a new efficient algorithm that provably outperforms previous approaches by analyzing the structure of each graph separately. Furthermore, we complement our results with an information-theoretic lower bound studying the limits of what can be done in this model. Finally, we corroborate our results with experimental evaluations.


Consistent Submodular Maximization

arXiv.org Machine Learning

Submodular optimization is a powerful framework for modeling and solving problems that exhibit the widespread diminishing returns property. Thanks to its effectiveness, it has been applied across diverse domains, including video analysis [Zheng et al., 2014], data summarization [Lin and Bilmes, 2011, Bairi et al., 2015], sparse reconstruction [Bach, 2010, Das and Kempe, 2011], and active learning [Golovin and Krause, 2011, Amanatidis et al., 2022]. In this paper, we focus on submodular maximization under cardinality constraints: given a submodular function f, a universe of elements V, and a cardinality constraint k, the goal is to find a set S of at most k elements that maximizes f(S). Submodular maximization under cardinality constraints is NP-hard, nevertheless efficient approximation algorithms exist for this task in both the centralized and the streaming setting [Nemhauser et al., 1978, Badanidiyuru et al., 2014, Kazemi et al., 2019]. One aspect of efficient approximation algorithms for submodular maximization that has received little attention so far, is the stability of the solution. In fact, for some of the known algorithms, even adding a single element to the universe of elements V may completely change the final output (see Appendix A for some examples). Unfortunately, this is problematic in many real-world applications where consistency is a fundamental system requirement.


A Scalable Algorithm for Individually Fair K-means Clustering

arXiv.org Artificial Intelligence

We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al. Given $n$ points $P$ in a metric space, let $\delta(x)$ for $x\in P$ be the radius of the smallest ball around $x$ containing at least $n / k$ points. A clustering is then called individually fair if it has centers within distance $\delta(x)$ of $x$ for each $x\in P$. While good approximation algorithms are known for this problem no efficient practical algorithms with good theoretical guarantees have been presented. We design the first fast local-search algorithm that runs in ~$O(nk^2)$ time and obtains a bicriteria $(O(1), 6)$ approximation. Then we show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.


A quasi-polynomial time algorithm for Multi-Dimensional Scaling via LP hierarchies

arXiv.org Machine Learning

Multi-dimensional Scaling (MDS) is a family of methods for embedding pair-wise dissimilarities between $n$ objects into low-dimensional space. MDS is widely used as a data visualization tool in the social and biological sciences, statistics, and machine learning. We study the Kamada-Kawai formulation of MDS: given a set of non-negative dissimilarities $\{d_{i,j}\}_{i , j \in [n]}$ over $n$ points, the goal is to find an embedding $\{x_1,\dots,x_n\} \subset \mathbb{R}^k$ that minimizes \[ \text{OPT} = \min_{x} \mathbb{E}_{i,j \in [n]} \left[ \left(1-\frac{\|x_i - x_j\|}{d_{i,j}}\right)^2 \right] \] Despite its popularity, our theoretical understanding of MDS is extremely limited. Recently, Demaine, Hesterberg, Koehler, Lynch, and Urschel (arXiv:2109.11505) gave the first approximation algorithm with provable guarantees for Kamada-Kawai, which achieves an embedding with cost $\text{OPT} +\epsilon$ in $n^2 \cdot 2^{\tilde{\mathcal{O}}(k \Delta^4 / \epsilon^2)}$ time, where $\Delta$ is the aspect ratio of the input dissimilarities. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $\Delta$: for target dimension $k$, we achieve a solution with cost $\mathcal{O}(\text{OPT}^{ \hspace{0.04in}1/k } \cdot \log(\Delta/\epsilon) )+ \epsilon$ in time $n^{ \mathcal{O}(1)} \cdot 2^{\tilde{\mathcal{O}}( k^2 (\log(\Delta)/\epsilon)^{k/2 + 1} ) }$. Our approach is based on a novel analysis of a conditioning-based rounding scheme for the Sherali-Adams LP Hierarchy. Crucially, our analysis exploits the geometry of low-dimensional Euclidean space, allowing us to avoid an exponential dependence on the aspect ratio $\Delta$. We believe our geometry-aware treatment of the Sherali-Adams Hierarchy is an important step towards developing general-purpose techniques for efficient metric optimization algorithms.


Multi-Swap $k$-Means++

arXiv.org Artificial Intelligence

The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local search steps obtained through the $k$-means++ sampling distribution to yield a $c$-approximation to the $k$-means clustering problem, where $c$ is a large absolute constant. Here we generalize and extend their local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a $9 + \varepsilon$ approximation ratio, which is the best possible for local search. Importantly we show that our approach yields substantial practical improvements, we show significant quality improvements over the approach of Lattanzi and Sohler (ICML 2019) on several datasets.


TF-GNN: Graph Neural Networks in TensorFlow

arXiv.org Artificial Intelligence

TensorFlow-GNN (TF-GNN) is a scalable library for Graph Neural Networks in TensorFlow. It is designed from the bottom up to support the kinds of rich heterogeneous graph data that occurs in today's information ecosystems. In addition to enabling machine learning researchers and advanced developers, TF-GNN offers low-code solutions to empower the broader developer community in graph learning. Many production models at Google use TF-GNN, and it has been recently released as an open source project. In this paper we describe the TF-GNN data model, its Keras message passing API, and relevant capabilities such as graph sampling and distributed training.


Fully Dynamic Submodular Maximization over Matroids

arXiv.org Artificial Intelligence

Maximizing monotone submodular functions under a matroid constraint is a classic algorithmic problem with multiple applications in data mining and machine learning. We study this classic problem in the fully dynamic setting, where elements can be both inserted and deleted in real-time. Our main result is a randomized algorithm that maintains an efficient data structure with an $\tilde{O}(k^2)$ amortized update time (in the number of additions and deletions) and yields a $4$-approximate solution, where $k$ is the rank of the matroid.


On Margin-Based Cluster Recovery with Oracle Queries

arXiv.org Machine Learning

We study an active cluster recovery problem where, given a set of $n$ points and an oracle answering queries like "are these two points in the same cluster?", the task is to recover exactly all clusters using as few queries as possible. We begin by introducing a simple but general notion of margin between clusters that captures, as special cases, the margins used in previous work, the classic SVM margin, and standard notions of stability for center-based clusterings. Then, under our margin assumptions we design algorithms that, in a variety of settings, recover all clusters exactly using only $O(\log n)$ queries. For the Euclidean case, $\mathbb{R}^m$, we give an algorithm that recovers arbitrary convex clusters, in polynomial time, and with a number of queries that is lower than the best existing algorithm by $\Theta(m^m)$ factors. For general pseudometric spaces, where clusters might not be convex or might not have any notion of shape, we give an algorithm that achieves the $O(\log n)$ query bound, and is provably near-optimal as a function of the packing number of the space. Finally, for clusterings realized by binary concept classes, we give a combinatorial characterization of recoverability with $O(\log n)$ queries, and we show that, for many concept classes in Euclidean spaces, this characterization is equivalent to our margin condition. Our results show a deep connection between cluster margins and active cluster recoverability.