Goto

Collaborating Authors

 Clustering


Within-Document Event Coreference with BERT-Based Contextualized Representations

arXiv.org Artificial Intelligence

Event coreference continues to be a challenging problem in information extraction. With the absence of any external knowledge bases for events, coreference becomes a clustering task that relies on effective representations of the context in which event mentions appear. Recent advances in contextualized language representations have proven successful in many tasks, however, their use in event linking been limited. Here we present a three part approach that (1) uses representations derived from a pretrained BERT model to (2) train a neural classifier to (3) drive a simple clustering algorithm to create coreference chains. We achieve state of the art results with this model on two standard datasets for within-document event coreference task and establish a new standard on a third newer dataset.


Structured Graph Learning for Scalable Subspace Clustering: From Single-view to Multi-view

arXiv.org Artificial Intelligence

Graph-based subspace clustering methods have exhibited promising performance. However, they still suffer some of these drawbacks: encounter the expensive time overhead, fail in exploring the explicit clusters, and cannot generalize to unseen data points. In this work, we propose a scalable graph learning framework, seeking to address the above three challenges simultaneously. Specifically, it is based on the ideas of anchor points and bipartite graph. Rather than building a $n\times n$ graph, where $n$ is the number of samples, we construct a bipartite graph to depict the relationship between samples and anchor points. Meanwhile, a connectivity constraint is employed to ensure that the connected components indicate clusters directly. We further establish the connection between our method and the K-means clustering. Moreover, a model to process multi-view data is also proposed, which is linear scaled with respect to $n$. Extensive experiments demonstrate the efficiency and effectiveness of our approach with respect to many state-of-the-art clustering methods.


HDMI: High-order Deep Multiplex Infomax

arXiv.org Artificial Intelligence

Networks have been widely used to represent the relations between objects such as academic networks and social networks, and learning embedding for networks has thus garnered plenty of research attention. Self-supervised network representation learning aims at extracting node embedding without external supervision. Recently, maximizing the mutual information between the local node embedding and the global summary (e.g. Deep Graph Infomax, or DGI for short) has shown promising results on many downstream tasks such as node classification. However, there are two major limitations of DGI. Firstly, DGI merely considers the extrinsic supervision signal (i.e., the mutual information between node embedding and global summary) while ignores the intrinsic signal (i.e., the mutual dependence between node embedding and node attributes). Secondly, nodes in a real-world network are usually connected by multiple edges with different relations, while DGI does not fully explore the various relations among nodes. To address the above-mentioned problems, we propose a novel framework, called High-order Deep Multiplex Infomax (HDMI), for learning node embedding on multiplex networks in a self-supervised way. To be more specific, we first design a joint supervision signal containing both extrinsic and intrinsic mutual information by high-order mutual information, and we propose a High-order Deep Infomax (HDI) to optimize the proposed supervision signal. Then we propose an attention based fusion module to combine node embedding from different layers of the multiplex network. Finally, we evaluate the proposed HDMI on various downstream tasks such as unsupervised clustering and supervised classification. The experimental results show that HDMI achieves state-of-the-art performance on these tasks.


DAC: Deep Autoencoder-based Clustering, a General Deep Learning Framework of Representation Learning

arXiv.org Artificial Intelligence

Clustering performs an essential role in many real world applications, such as market research, pattern recognition, data analysis, and image processing. However, due to the high dimensionality of the input feature values, the data being fed to clustering algorithms usually contains noise and thus could lead to in-accurate clustering results. While traditional dimension reduction and feature selection algorithms could be used to address this problem, the simple heuristic rules used in those algorithms are based on some particular assumptions. When those assumptions does not hold, these algorithms then might not work. In this paper, we propose DAC, Deep Autoencoder-based Clustering, a generalized data-driven framework to learn clustering representations using deep neuron networks. Experiment results show that our approach could effectively boost performance of the K-Means clustering algorithm on a variety types of datasets.


ThetA -- fast and robust clustering via a distance parameter

arXiv.org Artificial Intelligence

Based on this, one can further divide distance-based methods into three categories: 1) assuming number of clusters as Clustering is a fundamental problem in machine known in advance, 2) a distance threshold as known or 3) learning where distance-based approaches have by assuming a limiting number of data points belonging to dominated the field for many decades. This set each particular cluster. of problems is often tackled by partitioning the data into K clusters where the number of clusters While clustering algorithms primarily focus on accurately is chosen apriori. While significant progress has partitioning the data, they also aimed at inferring information been made on these lines over the years, it is well from a data exploration standpoint. In this work, we established that as the number of clusters or dimensions primarily focus on distance-based clustering given its broad increase, current approaches dwell in adoption and propose a new framework, ThetA, which uses local minima resulting in suboptimal solutions.


Practical Guide To K-Means Clustering

#artificialintelligence

Clustering is one of the most popular and widespread unsupervised machine learning method used for data analysis and mining patterns. At its core, clustering is the grouping of similar observations based upon the characteristics. There are multiple approaches for generating clusters of similar objects. However, in this section, you will learn how to build groups based on the k-Means algorithm. In simple words, k-means clustering is a technique that aims to divide the data into k number of clusters.


Clustered Hierarchical Anomaly and Outlier Detection Algorithms

arXiv.org Machine Learning

Anomaly and outlier detection in datasets is a long-standing problem in machine learning. In some cases, anomaly detection is easy, such as when data are drawn from well-characterized distributions such as the Gaussian. However, when data occupy high-dimensional spaces, anomaly detection becomes more difficult. We present CLAM (Clustered Learning of Approximate Manifolds), a fast hierarchical clustering technique that learns a manifold in a Banach space defined by a distance metric. CLAM induces a graph from the cluster tree, based on overlapping clusters determined by several geometric and topological features. On these graphs, we implement CHAODA (Clustered Hierarchical Anomaly and Outlier Detection Algorithms), exploring various properties of the graphs and their constituent clusters to compute scores of anomalousness. On 24 publicly available datasets, we compare the performance of CHAODA (by measure of ROC AUC) to a variety of state-of-the-art unsupervised anomaly-detection algorithms. Six of the datasets are used for training. CHAODA outperforms other approaches on 14 of the remaining 18 datasets.


Notebook -- Machine Learning, Statistics, and Data Mining for Heliophysics

#artificialintelligence

The space between the Sun and the Earth is not empty. Instead, it is filled with streams of plasma (ions and electrons) called the solar wind, which travels nearly radially out from the Sun. Since the earliest spacecraft measurements, the solar wind has broadly been classified into two types, fast and slow, based solely its speed (Neugebauer and Snyder, 1966; Stakhiv et al., 2015). This duality has also been observed in measurements of the elemental composition and ion charge states of the solar wind, suggesting that the fast and slow wind originate from different solar source structures (von Steiger et al., 2000; Geiss, Gloeckler, and Von Steiger, 1995). Fast wind is found to originate from coronal holes (Sheeley, Harvey, and Feldman, 1976). These are magnetically open regions of the corona where the plasma can freely escape, meaning that coronal holes appear dark in EUV emission (since there is less time for the plasma to be heated). The formation and release of the slow wind is a ...


Nature-Inspired Optimization Algorithms: Research Direction and Survey

arXiv.org Artificial Intelligence

Nature-inspired algorithms are commonly used for solving the various optimization problems. In past few decades, various researchers have proposed a large number of nature-inspired algorithms. Some of these algorithms have proved to be very efficient as compared to other classical optimization methods. A young researcher attempting to undertake or solve a problem using nature-inspired algorithms is bogged down by a plethora of proposals that exist today. Not every algorithm is suited for all kinds of problem. Some score over others. In this paper, an attempt has been made to summarize various leading research proposals that shall pave way for any new entrant to easily understand the journey so far. Here, we classify the nature-inspired algorithms as natural evolution based, swarm intelligence based, biological based, science based and others. In this survey, widely acknowledged nature-inspired algorithms namely- ACO, ABC, EAM, FA, FPA, GA, GSA, JAYA, PSO, SFLA, TLBO and WCA, have been studied. The purpose of this review is to present an exhaustive analysis of various nature-inspired algorithms based on its source of inspiration, basic operators, control parameters, features, variants and area of application where these algorithms have been successfully applied. It shall also assist in identifying and short listing the methodologies that are best suited for the problem.


A Constant Approximation Algorithm for Sequential No-Substitution k-Median Clustering under a Random Arrival Order

arXiv.org Machine Learning

Clustering is a fundamental unsupervised learning task used for various applications, such as anomaly detection (Leung and Leckie, 2005), recommender systems (Shepitsen et al., 2008) and cancer diagnosis (Zheng et al., 2014). In recent years, research on sequential clustering has been actively studied, motivated by applications in which data arrives sequentially, such as online recommender systems (Nasraoui et al., 2007) and online community detection (Aggarwal, 2003). In this work, we study k-median clustering in the sequential no-substitution setting, a term first introduced in Hess and Sabato (2020). In this setting, a stream of data points is sequentially observed, and some of these points are selected by the algorithm as cluster centers. However, a point can be selected as a center only immediately after it is observed, before observing the next point. In addition, a selected center cannot be substituted later. This setting is motivated by applications in which center selection is mapped to a real-world irreversible action, such as providing users with promotional gifts or recruiting participants to a clinical trial. The goal in the no-substitution k-median setting is to obtain a near-optimal k-median risk value, while selecting a number of centers that is as close as possible to k.