Selective inference for k-means clustering

Mar-29-2022–arXiv.org Machine Learning

If the groups under investigation are pre-specified, i.e., not a function of the observed data, then classical hypothesis tests will control the Type I error rate. However, it is increasingly common to want to test for a difference in means between groups that are defined through the observed data, e.g., via the output of a clustering algorithm. For instance, in single-cell RNA-sequencing analysis, researchers often first cluster the cells, and then test for a difference in the expected gene expression levels between the clusters to quantify up-or down-regulation of genes, annotate known cell types, and identify new cell types (Grün et al., 2015; Aizarani et al., 2019; Lähnemann et al., 2020; Zhang et al., 2019; Doughty & Kerkhoven, 2020). In fact, the inferential challenges resulting from testing data-guided hypotheses have been described as a "grand challenge" in the field of genomics (Lähnemann et al., 2020), and papers in the field continue to overlook this issue: as an example, seurat (Stuart et al., 2019), the state-of-the-art single-cell RNA sequencing analysis tool, tests for differential gene expression between groups obtained via clustering, with a note that "p-values [from these hypotheses] should be interpreted cautiously, as the genes used for clustering are the same genes tested for differential expression." Testing data-guided hypothesis also arises in the field of neuroscience (Kriegeskorte et al., 2009; Button, 2019), social psychology (Hung & Fithian, 2020), and physical sciences (Friederich et al., 2020; Pollice

artificial intelligence, inference, machine learning, (17 more...)

arXiv.org Machine Learning

Mar-29-2022

arXiv.org PDF

Add feedback

Country:
- Antarctica (0.04)
- North America > United States
  - California (0.04)
  - Washington > King County
    - Seattle (0.14)
  - New York > New York County
    - New York City (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)

Genre:
- Research Report > Experimental Study (0.68)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (0.88)
  - Therapeutic Area > Neurology (0.34)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found