AITopics | sigclust

Collaborating Authors

sigclust

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Gaussian Mixture Clustering Using Relative Tests of Fit

Chakravarti, Purvasha, Balakrishnan, Sivaraman, Wasserman, Larry

arXiv.org Machine LearningOct-6-2019

We consider clustering based on significance tests for Gaussian Mixture Models (GMMs). Our starting point is the SigClust method developed by Liu et al. (2008), which introduces a test based on the k-means objective (with k = 2) to decide whether the data should be split into two clusters. When applied recursively, this test yields a method for hierarchical clustering that is equipped with a significance guarantee. We study the limiting distribution and power of this approach in some examples and show that there are large regions of the parameter space where the power is low. We then introduce a new test based on the idea of relative fit. Unlike prior work, we test for whether a mixture of Gaussians provides a better fit relative to a single Gaussian, without assuming that either model is correct. The proposed test has a simple critical value and provides provable error control. One version of our test provides exact, finite sample control of the type I error. We show how our tests can be used for hierarchical clustering as well as in a sequential manner for model selection. We conclude with an extensive simulation study and a cluster analysis of a gene expression dataset.

ab 1 2, equation, sigclust, (16 more...)

arXiv.org Machine Learning

1910.02566

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Experimental Study (0.87)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Non-Parametric Cluster Significance Testing with Reference to a Unimodal Null Distribution

Helgeson, Erika S., Bair, Eric

arXiv.org Machine LearningOct-5-2016

Cluster analysis is an unsupervised learning strategy that can be employed to identify subgroups of observations in data sets of unknown structure. This strategy is particularly useful for analyzing high-dimensional data such as microarray gene expression data. Many clustering methods are available, but it is challenging to determine if the identified clusters represent distinct subgroups. We propose a novel strategy to investigate the significance of identified clusters by comparing the within- cluster sum of squares from the original data to that produced by clustering an appropriate unimodal null distribution. The null distribution we present for this problem uses kernel density estimation and thus does not require that the data follow any particular distribution. We find that our method can accurately test for the presence of clustering even when the number of features is high.

artificial intelligence, cluster index, machine learning, (12 more...)

arXiv.org Machine Learning

1610.01424

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.90)

Add feedback

Significance Analysis of High-Dimensional, Low-Sample Size Partially Labeled Data

Lu, Qiyi, Qiao, Xingye

arXiv.org Machine LearningSep-20-2015

Classification and clustering are both important topics in statistical learning. A natural question herein is whether predefined classes are really different from one another, or whether clusters are really there. Specifically, we may be interested in knowing whether the two classes defined by some class labels (when they are provided), or the two clusters tagged by a clustering algorithm (where class labels are not provided), are from the same underlying distribution. Although both are challenging questions for the high-dimensional, low-sample size data, there has been some recent development for both. However, when it is costly to manually place labels on observations, it is often that only a small portion of the class labels is available. In this article, we propose a significance analysis approach for such type of data, namely partially labeled data. Our method makes use of the whole data and tries to test the class difference as if all the labels were observed. Compared to a testing method that ignores the label information, our method provides a greater power, meanwhile, maintaining the size, illustrated by a comprehensive simulation study. Theoretical properties of the proposed method are studied with emphasis on the high-dimensional, low-sample size setting. Our simulated examples help to understand when and how the information extracted from the labeled data can be effective. A real data example further illustrates the usefulness of the proposed method.

artificial intelligence, machine learning, sigclust, (17 more...)

arXiv.org Machine Learning

1509.06088

Country: North America > United States > New York (0.14)

Genre: Research Report > Experimental Study (0.71)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback