AITopics

2412.02492

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

arXiv.org Artificial IntelligenceJun-13-2024

Dynamic Correlation Clustering in Sublinear Update Time

Cohen-Addad, Vincent, Lattanzi, Silvio, Maggiori, Andreas, Parotsidis, Nikos

Clustering is a cornerstone of contemporary machine learning and data analysis. A successful clustering algorithm partitions data elements so that similar items reside within the same group, while dissimilar items are separated. Introduced in 2004 by Bansal, Blum and Chawla Bansal et al. ((2004)), the correlation clustering objective offers a natural approach to model this problem. Due to its concise and elegant formulation, this problem has drawn significant interest from researchers and practitioners, leading to applications across diverse domains. These include ensemble clustering identification ((Bonchi et al., 2013)), duplicate detection ((Arasu et al., 2009)), community mining ((Chen et al., 2012)), disambiguation tasks ((Kalashnikov et al., 2008)), automated labeling ((Agrawal et al., 2009; Chakrabarti et al., 2008)), and many more. In the correlation clustering problem we are given a graph where each edge has either a positive or negative label, and where a positive edge (u, v) indicates that u, v are similar elements (and a negative edge (u, v) indicates that u, v are dissimilar), the objective is to compute a partition of the graph that minimizes the number of negative edges within clusters plus positive edges between clusters. Since the problem is NP-hard, researchers have focused on designing approximation algorithms. The algorithm proposed by Cao et al. ((2024)) achieves an approximation ratio of 1.43 + ϵ, improving upon the previous 1.73 + ϵ and 1.994 + ϵ achieved by Cohen-Addad et al. ((2023, 2022b)). Prior to these developments, the best approximation guarantee of 2.06 was attained by the algorithm of Chawla et al. ((2015)).

artificial intelligence, machine learning, node, (19 more...)

2406.09137

Country:

North America > United States > Maryland (0.14)
North America > United States > California (0.14)

Genre: Research Report (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.54)

arXiv.org Machine LearningJun-7-2024

Multi-View Stochastic Block Models

Cohen-Addad, Vincent, d'Orsi, Tommaso, Lattanzi, Silvio, Nasser, Rajai

Graph clustering is a central topic in unsupervised learning with a multitude of practical applications. In recent years, multi-view graph clustering has gained a lot of attention for its applicability to real-world instances where one has access to multiple data sources. In this paper we formalize a new family of models, called \textit{multi-view stochastic block models} that captures this setting. For this model, we first study efficient algorithms that naively work on the union of multiple graphs. Then, we introduce a new efficient algorithm that provably outperforms previous approaches by analyzing the structure of each graph separately. Furthermore, we complement our results with an information-theoretic lower bound studying the limits of what can be done in this model. Finally, we corroborate our results with experimental evaluations.

algorithm, artificial intelligence, machine learning, (16 more...)

2406.0486

Country: Europe > France (0.14)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.86)

arXiv.org Machine LearningMay-30-2024

Consistent Submodular Maximization

Dütting, Paul, Fusco, Federico, Lattanzi, Silvio, Norouzi-Fard, Ashkan, Zadimoghaddam, Morteza

Submodular optimization is a powerful framework for modeling and solving problems that exhibit the widespread diminishing returns property. Thanks to its effectiveness, it has been applied across diverse domains, including video analysis [Zheng et al., 2014], data summarization [Lin and Bilmes, 2011, Bairi et al., 2015], sparse reconstruction [Bach, 2010, Das and Kempe, 2011], and active learning [Golovin and Krause, 2011, Amanatidis et al., 2022]. In this paper, we focus on submodular maximization under cardinality constraints: given a submodular function f, a universe of elements V, and a cardinality constraint k, the goal is to find a set S of at most k elements that maximizes f(S). Submodular maximization under cardinality constraints is NP-hard, nevertheless efficient approximation algorithms exist for this task in both the centralized and the streaming setting [Nemhauser et al., 1978, Badanidiyuru et al., 2014, Kazemi et al., 2019]. One aspect of efficient approximation algorithms for submodular maximization that has received little attention so far, is the stability of the solution. In fact, for some of the known algorithms, even adding a single element to the universe of elements V may completely change the final output (see Appendix A for some examples). Unfortunately, this is problematic in many real-world applications where consistency is a fundamental system requirement.

algorithm, artificial intelligence, machine learning, (15 more...)

2405.19977

Country: Europe > Italy (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceFeb-9-2024

A Scalable Algorithm for Individually Fair K-means Clustering

Bateni, MohammadHossein, Cohen-Addad, Vincent, Epasto, Alessandro, Lattanzi, Silvio

We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al. Given $n$ points $P$ in a metric space, let $\delta(x)$ for $x\in P$ be the radius of the smallest ball around $x$ containing at least $n / k$ points. A clustering is then called individually fair if it has centers within distance $\delta(x)$ of $x$ for each $x\in P$. While good approximation algorithms are known for this problem no efficient practical algorithms with good theoretical guarantees have been presented. We design the first fast local-search algorithm that runs in ~$O(nk^2)$ time and obtains a bicriteria $(O(1), 6)$ approximation. Then we show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.

artificial intelligence, data mining, machine learning, (17 more...)

2402.0673

Country:

Europe (0.93)
North America > United States > Colorado (0.28)
North America > United States > California > Los Angeles County > Long Beach (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Machine LearningNov-29-2023

A quasi-polynomial time algorithm for Multi-Dimensional Scaling via LP hierarchies

Bakshi, Ainesh, Cohen-Addad, Vincent, Hopkins, Samuel B., Jayaram, Rajesh, Lattanzi, Silvio

Multi-dimensional Scaling (MDS) is a family of methods for embedding pair-wise dissimilarities between $n$ objects into low-dimensional space. MDS is widely used as a data visualization tool in the social and biological sciences, statistics, and machine learning. We study the Kamada-Kawai formulation of MDS: given a set of non-negative dissimilarities $\{d_{i,j}\}_{i , j \in [n]}$ over $n$ points, the goal is to find an embedding $\{x_1,\dots,x_n\} \subset \mathbb{R}^k$ that minimizes \[ \text{OPT} = \min_{x} \mathbb{E}_{i,j \in [n]} \left[ \left(1-\frac{\|x_i - x_j\|}{d_{i,j}}\right)^2 \right] \] Despite its popularity, our theoretical understanding of MDS is extremely limited. Recently, Demaine, Hesterberg, Koehler, Lynch, and Urschel (arXiv:2109.11505) gave the first approximation algorithm with provable guarantees for Kamada-Kawai, which achieves an embedding with cost $\text{OPT} +\epsilon$ in $n^2 \cdot 2^{\tilde{\mathcal{O}}(k \Delta^4 / \epsilon^2)}$ time, where $\Delta$ is the aspect ratio of the input dissimilarities. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $\Delta$: for target dimension $k$, we achieve a solution with cost $\mathcal{O}(\text{OPT}^{ \hspace{0.04in}1/k } \cdot \log(\Delta/\epsilon) )+ \epsilon$ in time $n^{ \mathcal{O}(1)} \cdot 2^{\tilde{\mathcal{O}}( k^2 (\log(\Delta)/\epsilon)^{k/2 + 1} ) }$. Our approach is based on a novel analysis of a conditioning-based rounding scheme for the Sherali-Adams LP Hierarchy. Crucially, our analysis exploits the geometry of low-dimensional Euclidean space, allowing us to avoid an exponential dependence on the aspect ratio $\Delta$. We believe our geometry-aware treatment of the Sherali-Adams Hierarchy is an important step towards developing general-purpose techniques for efficient metric optimization algorithms.

algorithm, artificial intelligence, machine learning, (18 more...)

2311.1784

Country: North America > United States > Maryland (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

arXiv.org Artificial IntelligenceSep-28-2023

Multi-Swap $k$-Means++

Beretta, Lorenzo, Cohen-Addad, Vincent, Lattanzi, Silvio, Parotsidis, Nikos

The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local search steps obtained through the $k$-means++ sampling distribution to yield a $c$-approximation to the $k$-means clustering problem, where $c$ is a large absolute constant. Here we generalize and extend their local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a $9 + \varepsilon$ approximation ratio, which is the best possible for local search. Importantly we show that our approach yields substantial practical improvements, we show significant quality improvements over the approach of Lattanzi and Sohler (ICML 2019) on several datasets.

algorithm, artificial intelligence, machine learning, (17 more...)

2309.16384

Country:

Europe (1.00)
North America > United States > Maryland (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

arXiv.org Artificial IntelligenceJul-23-2023

TF-GNN: Graph Neural Networks in TensorFlow

Ferludin, Oleksandr, Eigenwillig, Arno, Blais, Martin, Zelle, Dustin, Pfeifer, Jan, Sanchez-Gonzalez, Alvaro, Li, Wai Lok Sibon, Abu-El-Haija, Sami, Battaglia, Peter, Bulut, Neslihan, Halcrow, Jonathan, de Almeida, Filipe Miguel Gonçalves, Gonnet, Pedro, Jiang, Liangze, Kothari, Parth, Lattanzi, Silvio, Linhares, André, Mayer, Brandon, Mirrokni, Vahab, Palowitch, John, Paradkar, Mihir, She, Jennifer, Tsitsulin, Anton, Villela, Kevin, Wang, Lisa, Wong, David, Perozzi, Bryan

TensorFlow-GNN (TF-GNN) is a scalable library for Graph Neural Networks in TensorFlow. It is designed from the bottom up to support the kinds of rich heterogeneous graph data that occurs in today's information ecosystems. In addition to enabling machine learning researchers and advanced developers, TF-GNN offers low-code solutions to empower the broader developer community in graph learning. Many production models at Google use TF-GNN, and it has been recently released as an open source project. In this paper we describe the TF-GNN data model, its Keras message passing API, and relevant capabilities such as graph sampling and distributed training.

artificial intelligence, graph, machine learning, (18 more...)

2207.03522

Country:

North America > United States (0.46)
Asia (0.28)

Genre: Research Report (0.50)

Industry: Information Technology > Services (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Artificial IntelligenceMay-31-2023

Fully Dynamic Submodular Maximization over Matroids

Dütting, Paul, Fusco, Federico, Lattanzi, Silvio, Norouzi-Fard, Ashkan, Zadimoghaddam, Morteza

Maximizing monotone submodular functions under a matroid constraint is a classic algorithmic problem with multiple applications in data mining and machine learning. We study this classic problem in the fully dynamic setting, where elements can be both inserted and deleted in real-time. Our main result is a randomized algorithm that maintains an efficient data structure with an $\tilde{O}(k^2)$ amortized update time (in the number of additions and deletions) and yields a $4$-approximate solution, where $k$ is the rank of the matroid.

artificial intelligence, level-construct, machine learning, (15 more...)

2305.19918

Country: Europe > Italy (0.14)

Genre: Research Report (0.50)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Machine LearningJun-9-2021

On Margin-Based Cluster Recovery with Oracle Queries

Bressan, Marco, Cesa-Bianchi, Nicolò, Lattanzi, Silvio, Paudice, Andrea

We study an active cluster recovery problem where, given a set of $n$ points and an oracle answering queries like "are these two points in the same cluster?", the task is to recover exactly all clusters using as few queries as possible. We begin by introducing a simple but general notion of margin between clusters that captures, as special cases, the margins used in previous work, the classic SVM margin, and standard notions of stability for center-based clusterings. Then, under our margin assumptions we design algorithms that, in a variety of settings, recover all clusters exactly using only $O(\log n)$ queries. For the Euclidean case, $\mathbb{R}^m$, we give an algorithm that recovers arbitrary convex clusters, in polynomial time, and with a number of queries that is lower than the best existing algorithm by $\Theta(m^m)$ factors. For general pseudometric spaces, where clusters might not be convex or might not have any notion of shape, we give an algorithm that achieves the $O(\log n)$ query bound, and is provably near-optimal as a function of the packing number of the space. Finally, for clusterings realized by binary concept classes, we give a combinatorial characterization of recoverability with $O(\log n)$ queries, and we show that, for many concept classes in Euclidean spaces, this characterization is equivalent to our margin condition. Our results show a deep connection between cluster margins and active cluster recoverability.

artificial intelligence, information retrieval query processing, natural language, (16 more...)

2106.04913

Country: Europe > Italy (0.14)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.34)