Existing work (Ji and Grishman, we consider the attention weight between 2008; McClosky et al., 2011; Li et al., 2013; two event mentions as a learned similarity, and we Chen et al., 2015; Du and Cardie, 2020; Li et al., ensure that the attention mechanism learns to align 2021a) traditionally uses a predefined list of event similar events using a semi-supervised contrastive types and their respective annotations to learn an loss. By doing this, we are able to leverage the event extraction model. However, these annotations large variety of semantic information in pretrained are both expensive and time-consuming to language models for clustering unseen types by using create. This problem is amplified when considering a trained attention head. Unlike (Huang and specialization-intensive domains such as scientific Ji, 2020), we are able to separate clustering from literature, which requires years of specialized experience learning, allowing specific task-suited clustering to understand even a specific niche. For algorithms to be selected.
BioBERT and BioMegatron are Transformers models adapted for the biomedical domain based on publicly available biomedical corpora. As such, they have the potential to encode large-scale biological knowledge. We investigate the encoding and representation of biological knowledge in these models, and its potential utility to support inference in cancer precision medicine - namely, the interpretation of the clinical significance of genomic alterations. We compare the performance of different transformer baselines; we use probing to determine the consistency of encodings for distinct entities; and we use clustering methods to compare and contrast the internal properties of the embeddings for genes, variants, drugs and diseases. We show that these models do indeed encode biological knowledge, although some of this is lost in fine-tuning for specific tasks. Finally, we analyse how the models behave with regard to biases and imbalances in the dataset.
We introduce a fast and explainable clustering method called CLASSIX. It consists of two phases, namely a greedy aggregation phase of the sorted data into groups of nearby data points, followed by the merging of groups into clusters. The algorithm is controlled by two scalar parameters, namely a distance parameter for the aggregation and another parameter controlling the minimal cluster size. Extensive experiments are conducted to give a comprehensive evaluation of the clustering performance on synthetic and real-world datasets, with various cluster shapes and low to high feature dimensionality. Our experiments demonstrate that CLASSIX competes with state-of-the-art clustering algorithms. The algorithm has linear space complexity and achieves near linear time complexity on a wide range of problems. Its inherent simplicity allows for the generation of intuitive explanations of the computed clusters.
Sparse linear prediction methods suffer from decreased prediction accuracy when the predictor variables have cluster structure (e.g. there are highly correlated groups of variables). To improve prediction accuracy, various methods have been proposed to identify variable clusters from the data and integrate cluster information into a sparse modeling process. But none of these methods achieve satisfactory performance for prediction, variable selection and variable clustering simultaneously. This paper presents Variable Cluster Principal Component Regression (VC-PCR), a prediction method that supervises variable selection and variable clustering in order to solve this problem. Experiments with real and simulated data demonstrate that, compared to competitor methods, VC-PCR achieves better prediction, variable selection and clustering performance when cluster structure is present.
We propose a general approach for distance based clustering, using the gradient of the cost function that measures clustering quality with respect to cluster assignments and cluster center positions. The approach is an iterative two step procedure (alternating between cluster assignment and cluster center updates) and is applicable to a wide range of functions, satisfying some mild assumptions. The main advantage of the proposed approach is a simple and computationally cheap update rule. Unlike previous methods that specialize to a specific formulation of the clustering problem, our approach is applicable to a wide range of costs, including non-Bregman clustering methods based on the Huber loss. We analyze the convergence of the proposed algorithm, and show that it converges to the set of appropriately defined fixed points, under arbitrary center initialization. In the special case of Bregman cost functions, the algorithm converges to the set of centroidal Voronoi partitions, which is consistent with prior works. Numerical experiments on real data demonstrate the effectiveness of the proposed method.
A major archetype of artificial intelligence is developing algorithms facilitating temporal efficiency and accuracy while boosting the generalization performance. Even with the latest developments in machine learning, a key limitation has been the inefficient feature extraction from the initial data, which is essential in performance optimization. Here, we introduce a feature extraction method inspired by sensory cortical networks in the brain. Dubbed as bioinspired cortex, the algorithm provides convergence to orthogonal features from streaming signals with superior computational efficiency while processing data in compressed form. We demonstrate the performance of the new algorithm using artificially created complex data by comparing it with the commonly used traditional clustering algorithms, such as Birch, GMM, and K-means. While the data processing time is significantly reduced, seconds versus hours, encoding distortions remain essentially the same in the new algorithm providing a basis for better generalization. Although we show herein the superior performance of the cortex model in clustering and vector quantization, it also provides potent implementation opportunities for machine learning fundamental components, such as reasoning, anomaly detection and classification in large scope applications, e.g., finance, cybersecurity, and healthcare.
In multivariate time series systems, it has been observed that certain groups of variables partially lead the evolution of the system, while other variables follow this evolution with a time delay; the result is a lead-lag structure amongst the time series variables. In this paper, we propose a method for the detection of lead-lag clusters of time series in multivariate systems. We demonstrate that the web of pairwise lead-lag relationships between time series can be helpfully construed as a directed network, for which there exist suitable algorithms for the detection of pairs of lead-lag clusters with high pairwise imbalance. Within our framework, we consider a number of choices for the pairwise lead-lag metric and directed network clustering components. Our framework is validated on both a synthetic generative model for multivariate lead-lag time series systems and daily real-world US equity prices data. We showcase that our method is able to detect statistically significant lead-lag clusters in the US equity market. We study the nature of these clusters in the context of the empirical finance literature on lead-lag relations and demonstrate how these can be used for the construction of predictive financial signals.
Topic taxonomies, which represent the latent topic (or category) structure of document collections, provide valuable knowledge of contents in many applications such as web search and information filtering. Recently, several unsupervised methods have been developed to automatically construct the topic taxonomy from a text corpus, but it is challenging to generate the desired taxonomy without any prior knowledge. In this paper, we study how to leverage the partial (or incomplete) information about the topic structure as guidance to find out the complete topic taxonomy. We propose a novel framework for topic taxonomy completion, named TaxoCom, which recursively expands the topic taxonomy by discovering novel sub-topic clusters of terms and documents. To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics. Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage but also outperforms all other baselines for a downstream task.
We address general-shaped clustering problems under very weak parametric assumptions with a two-step hybrid robust clustering algorithm based on trimmed k-means and hierarchical agglomeration. The algorithm has low computational complexity and effectively identifies the clusters also in presence of data contamination. We also present natural generalizations of the approach as well as an adaptive procedure to estimate the amount of contamination in a data-driven fashion. Our proposal outperforms state-of-the-art robust, model-based methods in our numerical simulations and real-world applications related to color quantization for image analysis, human mobility patterns based on GPS data, biomedical images of diabetic retinopathy, and functional data across weather stations.
We introduce an algorithm for computing geodesics on sampled manifolds that relies on simulation of quantum dynamics on a graph embedding of the sampled data. Our approach exploits classic results in semiclassical analysis and the quantum-classical correspondence, and forms a basis for techniques to learn the manifold from which a dataset is sampled, and subsequently for nonlinear dimensionality reduction of high-dimensional datasets. We illustrate the new algorithm with data sampled from model manifolds and also by a clustering demonstration based on COVID-19 mobility data. Finally, our method reveals interesting connections between the discretization provided by data sampling and quantization.