Goto

Collaborating Authors

 Clustering


Fine-grained Prediction of Political Leaning on Social Media with Unsupervised Deep Learning

Journal of Artificial Intelligence Research

Predicting the political leaning of social media users is an increasingly popular task, given its usefulness for electoral forecasts, opinion dynamics models and for studying the political dimension of polarization and disinformation. Here, we propose a novel unsupervised technique for learning fine-grained political leaning from the textual content of social media posts. Our technique leverages a deep neural network for learning latent political ideologies in a representation learning task. Then, users are projected in a low-dimensional ideology space where they are subsequently clustered. The political leaning of a user is automatically derived from the cluster to which the user is assigned. We evaluated our technique in two challenging classification tasks and we compared it to baselines and other state-of-the-art approaches. Our technique obtains the best results among all unsupervised techniques, with micro F1 = 0.426 in the 8-class task and micro F1 = 0.772 in the 3-class task. Other than being interesting on their own, our results also pave the way for the development of new and better unsupervised approaches for the detection of fine-grained political leaning.


RVN Algorithm

#artificialintelligence

When we need to cluster a data set, the first couple of algorithms we might look into are K means, DB scan, or hierarchical clustering algorithm. Those classic clustering algorithms always treat each data point as a dot. However, those data points usually have size or boundary(bounding box) in real life. Ignoring the edge of points might cause further bias. RVN algorithm is a method that considers points and the bounding box of each point.


Towards Continuous Consistency Axiom

arXiv.org Artificial Intelligence

Development of new algorithms in the area of machine learning, especially clustering, comparative studies of such algorithms as well as testing according to software engineering principles requires availability of labeled data sets. While standard benchmarks are made available, a broader range of such data sets is necessary in order to avoid the problem of overfitting. In this context, theoretical works on axiomatization of clustering algorithms, especially axioms on clustering preserving transformations are quite a cheap way to produce labeled data sets from existing ones. However, the frequently cited axiomatic system of Kleinberg:2002, as we show in this paper, is not applicable for finite dimensional Euclidean spaces, in which many algorithms like $k$-means, operate. In particular, the so-called outer-consistency axiom fails upon making small changes in datapoint positions and inner-consistency axiom is valid only for identity transformation in general settings. Hence we propose an alternative axiomatic system, in which Kleinberg's inner consistency axiom is replaced by a centric consistency axiom and outer consistency axiom is replaced by motion consistency axiom. We demonstrate that the new system is satisfiable for a hierarchical version of $k$-means with auto-adjusted $k$, hence it is not contradictory. Additionally, as $k$-means creates convex clusters only, we demonstrate that it is possible to create a version detecting concave clusters and still the axiomatic system can be satisfied. The practical application area of such an axiomatic system may be the generation of new labeled test data from existent ones for clustering algorithm testing. %We propose the gravitational consistency as a replacement which does not have this deficiency.


Semi-supervised New Event Type Induction and Description via Contrastive Loss-Enforced Batch Attention

arXiv.org Artificial Intelligence

Existing work (Ji and Grishman, we consider the attention weight between 2008; McClosky et al., 2011; Li et al., 2013; two event mentions as a learned similarity, and we Chen et al., 2015; Du and Cardie, 2020; Li et al., ensure that the attention mechanism learns to align 2021a) traditionally uses a predefined list of event similar events using a semi-supervised contrastive types and their respective annotations to learn an loss. By doing this, we are able to leverage the event extraction model. However, these annotations large variety of semantic information in pretrained are both expensive and time-consuming to language models for clustering unseen types by using create. This problem is amplified when considering a trained attention head. Unlike (Huang and specialization-intensive domains such as scientific Ji, 2020), we are able to separate clustering from literature, which requires years of specialized experience learning, allowing specific task-suited clustering to understand even a specific niche. For algorithms to be selected.


Inference of Multiscale Gaussian Graphical Model

arXiv.org Machine Learning

Gaussian Graphical Models (GGMs) are widely used for exploratory data analysis in various fields such as genomics, ecology, psychometry. In a high-dimensional setting, when the number of variables exceeds the number of observations by several orders of magnitude, the estimation of GGM is a difficult and unstable optimization problem. Clustering of variables or variable selection is often performed prior to GGM estimation. We propose a new method allowing to simultaneously infer a hierarchical clustering structure and the graphs describing the structure of independence at each level of the hierarchy. This method is based on solving a convex optimization problem combining a graphical lasso penalty with a fused type lasso penalty. Results on real and synthetic data are presented.


Improving performance of aircraft detection in satellite imagery while limiting the labelling effort: Hybrid active learning

arXiv.org Artificial Intelligence

The earth observation industry provides satellite imagery with high spatial resolution and short revisit time. To allow efficient operational employment of these images, automating certain tasks has become necessary. In the defense domain, aircraft detection on satellite imagery is a valuable tool for analysts. Obtaining high performance detectors on such a task can only be achieved by leveraging deep learning and thus us-ing a large amount of labeled data. To obtain labels of a high enough quality, the knowledge of military experts is needed.We propose a hybrid clustering active learning method to select the most relevant data to label, thus limiting the amount of data required and further improving the performances. It combines diversity- and uncertainty-based active learning selection methods. For aircraft detection by segmentation, we show that this method can provide better or competitive results compared to other active learning methods.


FCM-DNN: diagnosing coronary artery disease by deep accuracy Fuzzy C-Means clustering model

arXiv.org Artificial Intelligence

Cardiovascular disease is one of the most challenging diseases in middle-aged and older people, which causes high mortality. Coronary artery disease (CAD) is known as a common cardiovascular disease. A standard clinical tool for diagnosing CAD is angiography. The main challenges are dangerous side effects and high angiography costs. Today, the development of artificial intelligence-based methods is a valuable achievement for diagnosing disease. Hence, in this paper, artificial intelligence methods such as neural network (NN), deep neural network (DNN), and Fuzzy C-Means clustering combined with deep neural network (FCM-DNN) are developed for diagnosing CAD on a cardiac magnetic resonance imaging (CMRI) dataset. The original dataset is used in two different approaches. First, the labeled dataset is applied to the NN and DNN to create the NN and DNN models. Second, the labels are removed, and the unlabeled dataset is clustered via the FCM method, and then, the clustered dataset is fed to the DNN to create the FCM-DNN model. By utilizing the second clustering and modeling, the training process is improved, and consequently, the accuracy is increased. As a result, the proposed FCM-DNN model achieves the best performance with a 99.91% accuracy specifying 10 clusters, i.e., 5 clusters for healthy subjects and 5 clusters for sick subjects, through the 10-fold cross-validation technique compared to the NN and DNN models reaching the accuracies of 92.18% and 99.63%, respectively. To the best of our knowledge, no study has been conducted for CAD diagnosis on the CMRI dataset using artificial intelligence methods. The results confirm that the proposed FCM-DNN model can be helpful for scientific and research centers.


Optimal Clustering with Bandit Feedback

arXiv.org Machine Learning

This paper considers the problem of online clustering with bandit feedback. A set of arms (or items) can be partitioned into various groups that are unknown. Within each group, the observations associated to each of the arms follow the same distribution with the same mean vector. At each time step, the agent queries or pulls an arm and obtains an independent observation from the distribution it is associated to. Subsequent pulls depend on previous ones as well as the previously obtained samples. The agent's task is to uncover the underlying partition of the arms with the least number of arm pulls and with a probability of error not exceeding a prescribed constant $\delta$. The problem proposed finds numerous applications from clustering of variants of viruses to online market segmentation. We present an instance-dependent information-theoretic lower bound on the expected sample complexity for this task, and design a computationally efficient and asymptotically optimal algorithm, namely Bandit Online Clustering (BOC). The algorithm includes a novel stopping rule for adaptive sequential testing that circumvents the need to exactly solve any NP-hard weighted clustering problem as its subroutines. We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower bound asymptotically, and significantly outperforms a non-adaptive baseline algorithm.


Zhou

AAAI Conferences

Clustering ensemble has emerged as an important extension of the classical clustering problem. It provides a framework for combining multiple base clusterings of a data set to generate a final consensus result. Most existing clustering methods simply combine clustering results without taking into account the noises, which may degrade the clustering performance. In this paper, we propose a novel robust clustering ensemble method. To improve the robustness, we capture the sparse and symmetric errors and integrate them into our robust and consensus framework to learn a low-rank matrix. Since the optimization of the objective function is difficult to solve, we develop a block coordinate descent algorithm which is theoretically guaranteed to converge. Experimental results on real world data sets demonstrate the effectiveness of our method.


Zhang

AAAI Conferences

Multi-task clustering and multi-view clustering have severally found wide applications and received much attention in recent years. Nevertheless, there are many clustering problems that involve both multi-task clustering and multi-view clustering, i.e., the tasks are closely related and each task can be analyzed from multiple views. In this paper, for non-negative data (e.g., documents), we introduce a multi-task multi-view clustering (MTMVC) framework which integrates within-view-task clustering, multi-view relationship learning and multi-task relationship learning. We then propose a specific algorithm to optimize the MTMVC framework. Experimental results show the superiority of the proposed algorithm over either multi-task clustering algorithms or multi-view clustering algorithms for multi-task clustering of multi-view data.