Goto

Collaborating Authors

 Clustering


Non-Negative Spherical Relaxations for Universe-Free Multi-Matching and Clustering

arXiv.org Machine Learning

We propose a novel non-negative spherical relaxation for optimization problems over binary matrices with injectivity constraints, which in particular has applications in multi-matching and clustering. We relax respective binary matrix constraints to the (high-dimensional) non-negative sphere. To optimize our relaxed problem, we use a conditional power iteration method to iteratively improve the objective function, while at same time sweeping over a continuous scalar parameter that is (indirectly) related to the universe size (or number of clusters). Opposed to existing procedures that require to fix the integer universe size before optimization, our method automatically adjusts the analogous continuous parameter. Furthermore, while our approach shares similarities with spectral multi-matching and spectral clustering, our formulation has the strong advantage that we do not rely on additional post-processing procedures to obtain binary results. Our method shows compelling results in various multi-matching and clustering settings, even when compared to methods that use the ground truth universe size (or number of clusters).


Powerset multi-class cross entropy loss for neural speaker diarization

arXiv.org Artificial Intelligence

Since its introduction in 2019, the whole end-to-end neural diarization (EEND) line of work has been addressing speaker diarization as a frame-wise multi-label classification problem with permutation-invariant training. Despite EEND showing great promise, a few recent works took a step back and studied the possible combination of (local) supervised EEND diarization with (global) unsupervised clustering. Yet, these hybrid contributions did not question the original multi-label formulation. We propose to switch from multi-label (where any two speakers can be active at the same time) to powerset multi-class classification (where dedicated classes are assigned to pairs of overlapping speakers). Through extensive experiments on 9 different benchmarks, we show that this formulation leads to significantly better performance (mostly on overlapping speech) and robustness to domain mismatch, while eliminating the detection threshold hyperparameter, critical for the multi-label formulation.


Schema First! Learn Versatile Knowledge Graph Embeddings by Capturing Semantics with MASCHInE

arXiv.org Artificial Intelligence

Knowledge graph embedding models (KGEMs) have gained considerable traction in recent years. These models learn a vector representation of knowledge graph entities and relations, a.k.a. knowledge graph embeddings (KGEs). Learning versatile KGEs is desirable as it makes them useful for a broad range of tasks. However, KGEMs are usually trained for a specific task, which makes their embeddings task-dependent. In parallel, the widespread assumption that KGEMs actually create a semantic representation of the underlying entities and relations (e.g., project similar entities closer than dissimilar ones) has been challenged. In this work, we design heuristics for generating protographs -- small, modified versions of a KG that leverage RDF/S information. The learnt protograph-based embeddings are meant to encapsulate the semantics of a KG, and can be leveraged in learning KGEs that, in turn, also better capture semantics. Extensive experiments on various evaluation benchmarks demonstrate the soundness of this approach, which we call Modular and Agnostic SCHema-based Integration of protograph Embeddings (MASCHInE). In particular, MASCHInE helps produce more versatile KGEs that yield substantially better performance for entity clustering and node classification tasks. For link prediction, using MASCHinE substantially increases the number of semantically valid predictions with equivalent rank-based performance.


DCSI -- An improved measure of cluster separability based on separation and connectedness

arXiv.org Machine Learning

Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. A review of the existing literature shows that neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate the central aspects of separability for density-based clustering: between-class separation and within-class connectedness. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlapping classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlapping classes that do not form meaningful clusters.


Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach

arXiv.org Machine Learning

Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The goal of this process is to provide a useful aid to the researcher that will help her/him to identify patterns among the data. Dealing with large databases, such patterns may not be easily detectable without the contribution of a clustering algorithm. This article provides a deep description of the most widely used clustering methodologies accompanied by useful presentations concerning suitable parameter selection and initializations. Simultaneously, this article not only represents a review highlighting the major elements of examined clustering techniques but emphasizes the comparison of these algorithms' clustering efficiency based on 3 datasets, revealing their existing weaknesses and capabilities through accuracy and complexity, during the confrontation of discrete and continuous observations. The produced results help us extract valuable conclusions about the appropriateness of the examined clustering techniques in accordance with the dataset's size.


Interpretable Spectral Variational AutoEncoder (ISVAE) for time series clustering

arXiv.org Machine Learning

The best encoding is the one that is interpretable in nature. In this work, we introduce a novel model that incorporates an interpretable bottleneck-termed the Filter Bank (FB)-at the outset of a Variational Autoencoder (VAE). This arrangement compels the VAE to attend on the most informative segments of the input signal, fostering the learning of a novel encoding ${f_0}$ which boasts enhanced interpretability and clusterability over traditional latent spaces. By deliberately constraining the VAE with this FB, we intentionally constrict its capacity to access broad input domain information, promoting the development of an encoding that is discernible, separable, and of reduced dimensionality. The evolutionary learning trajectory of ${f_0}$ further manifests as a dynamic hierarchical tree, offering profound insights into cluster similarities. Additionally, for handling intricate data configurations, we propose a tailored decoder structure that is symmetrically aligned with FB's architecture. Empirical evaluations highlight the superior efficacy of ISVAE, which compares favorably to state-of-the-art results in clustering metrics across real-world datasets.


Accurate prediction of international trade flows: Leveraging knowledge graphs and their embeddings

arXiv.org Artificial Intelligence

As a result, KR is critical to offering a simple strategy for defining relevant and contextual information within a finite number of facts from a specific domain of interest; these facts are referred to as a knowledge base (KB). In the past years, Knowledge Graph (KG), as a form of KR, has gained attention because it provides a contextual, natural, and human-like form of representing knowledge in specific domains and common sense. KG is formed in statements called triples on the T = (h, r, t) form, where h (head) and t (tail) represent objects in real life, and r, the relation is the connection between those entities. Internet companies like Google, Wikipedia, and Facebook have found a simple but powerful unified tool in the KG field to describe their multi-structured and multi-dimensional knowledge base, capturing user data to transform it into vast KBs [3]. The KG approach is particularly relevant to studying international trade, a significant cornerstone of economic and social development in the globalized economy [4, 5]. International trade is complex and interconnected, with multiple entities (commodities, companies, and countries) interacting in multiple ways [6]. This method helps to understand those complex interactions in a structured and intuitive way. In international economics, the gravity model, a fundamental part of the current method, is widely used to predict trade relations between entities based on factors like size (GDP, population) and distance or other factors [7, 8, 9].


Latent class analysis with weighted responses

arXiv.org Machine Learning

The latent class model has been proposed as a powerful tool for cluster analysis of categorical data in various fields such as social, psychological, behavioral, and biological sciences. However, one important limitation of the latent class model is that it is only suitable for data with binary responses, making it fail to model real-world data with continuous or negative responses. In many applications, ignoring the weights throws out a lot of potentially valuable information contained in the weights. To address this limitation, we propose a novel generative model, the weighted latent class model (WLCM). Our model allows data's response matrix to be generated from an arbitrary distribution with a latent class structure. In comparison to the latent class model, our WLCM is more realistic and more general. To our knowledge, our WLCM is the first model for latent class analysis with weighted responses. We investigate the identifiability of the model and propose an efficient algorithm for estimating the latent classes and other model parameters. We show that the proposed algorithm enjoys consistent estimation. The performance of the proposed algorithm is investigated using both computer-generated and real-world weighted response data.


Fast and Simple Spectral Clustering in Theory and Practice

arXiv.org Artificial Intelligence

Spectral clustering is a popular and effective algorithm designed to find $k$ clusters in a graph $G$. In the classical spectral clustering algorithm, the vertices of $G$ are embedded into $\mathbb{R}^k$ using $k$ eigenvectors of the graph Laplacian matrix. However, computing this embedding is computationally expensive and dominates the running time of the algorithm. In this paper, we present a simple spectral clustering algorithm based on a vertex embedding with $O(\log(k))$ vectors computed by the power method. The vertex embedding is computed in nearly-linear time with respect to the size of the graph, and the algorithm provably recovers the ground truth clusters under natural assumptions on the input graph. We evaluate the new algorithm on several synthetic and real-world datasets, finding that it is significantly faster than alternative clustering algorithms, while producing results with approximately the same clustering accuracy.


G-SPEED: General SParse Efficient Editing MoDel

arXiv.org Artificial Intelligence

Large Language Models~(LLMs) have demonstrated incredible capabilities in understanding, generating, and manipulating languages. Through human-model interactions, LLMs can automatically understand human-issued instructions and output the expected contents, which can significantly increase working efficiency. In various types of real-world demands, editing-oriented tasks account for a considerable proportion, which involves an interactive process that entails the continuous refinement of existing texts to meet specific criteria. Due to the need for multi-round human-model interaction and the generation of complicated editing tasks, there is an emergent need for efficient general editing models. In this paper, we propose \underline{\textbf{G}}eneral \underline{\textbf{SP}}arse \underline{\textbf{E}}fficient \underline{\textbf{E}}diting Mo\underline{\textbf{D}}el~(\textbf{G-SPEED}), which can fulfill diverse editing requirements through a single model while maintaining low computational costs. Specifically, we first propose a novel unsupervised text editing data clustering algorithm to deal with the data scarcity problem. Subsequently, we introduce a sparse editing model architecture to mitigate the inherently limited learning capabilities of small language models. The experimental outcomes indicate that G-SPEED, with its 508M parameters, can surpass LLMs equipped with 175B parameters. Our code and model checkpoints are available at \url{https://github.com/Banner-Z/G-SPEED}.