Goto

Collaborating Authors

 witten


Selective inference for k-means clustering

Chen, Yiqun T., Witten, Daniela M.

arXiv.org Machine Learning

If the groups under investigation are pre-specified, i.e., not a function of the observed data, then classical hypothesis tests will control the Type I error rate. However, it is increasingly common to want to test for a difference in means between groups that are defined through the observed data, e.g., via the output of a clustering algorithm. For instance, in single-cell RNA-sequencing analysis, researchers often first cluster the cells, and then test for a difference in the expected gene expression levels between the clusters to quantify up-or down-regulation of genes, annotate known cell types, and identify new cell types (Grün et al., 2015; Aizarani et al., 2019; Lähnemann et al., 2020; Zhang et al., 2019; Doughty & Kerkhoven, 2020). In fact, the inferential challenges resulting from testing data-guided hypotheses have been described as a "grand challenge" in the field of genomics (Lähnemann et al., 2020), and papers in the field continue to overlook this issue: as an example, seurat (Stuart et al., 2019), the state-of-the-art single-cell RNA sequencing analysis tool, tests for differential gene expression between groups obtained via clustering, with a note that "p-values [from these hypotheses] should be interpreted cautiously, as the genes used for clustering are the same genes tested for differential expression." Testing data-guided hypothesis also arises in the field of neuroscience (Kriegeskorte et al., 2009; Button, 2019), social psychology (Hung & Fithian, 2020), and physical sciences (Friederich et al., 2020; Pollice


Introduction to Statistical Learning Second Edition - KDnuggets

#artificialintelligence

An Introduction to Statistical Learning, with Applications in R, written by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, is an absolute classic in the space. The book, a staple of statistical learning texts, is accessible to readers of all levels, and can be read without much of an existing foundational knowledge in the area. While the original has been around since 2013, the second edition was published very recently, and is now freely-available via PDF on the book's website. As the scale and scope of data collection continue to increase across virtually all fields, statistical learning has become a critical toolkit for anyone who wishes to understand data. An Introduction to Statistical Learning provides a broad and less technical treatment of key topics in statistical learning.


The Essence of Machine Learning

#artificialintelligence

I thought I would end off the year with a not-so-serious post about capturing the essence of machine learning. In the past, you have undoubtedly explored a variety of in-depth and semi in-depth offerings on what machine learning is, and explored its relationships to numerous other topics. Starting from some initial common point of reference when discussing such complex concepts is always a good idea; the problem is, there exist innumerable initial common points of reference for a topics such as machine learning. So I thought, why not examine some of these points of reference? And now, without further ado, as an exercise in what may seem to be semantics, let's explore some 30,000 feet definitions of what machine learning is.


The Most Important Machine Learning Books

#artificialintelligence

This list is constantly updated. Didn't find the book you think is great? Let us know and we will consider adding this book to the list. Read our previous post "Glossary of Machine Learning Terms" or subscribe to our RSS feed.


Data Mining: Practical Machine Learning Tools and Techniques: Eibe Frank & Mark A. Hall Ian H. Witten: 9789380501864: Amazon.com: Books

@machinelearnbot

Data mining is not an intuitive activity. It requires skills and techniques which can be honed by using books like data mining: practical machine learning tools and techniques. This book is an ultimate guide for applying machine learning techniques and tools in real-world situations of data mining. The book will help you interpret outputs, evaluate results and prepare inputs and will provide the algorithmic methods for efficient data mining. You will not only find the explanation of concepts in this book but will also come across practical advice for successful data mining.


Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems): Ian H. Witten, Eibe Frank: 9780120884070: Amazon.com: Books

@machinelearnbot

I chose this book after looking at a number of options. The text is clearly written for individuals with an bachelor-level education in computer science. The author prefers pseudocode and text explanations of algorithms to equations, and when he does use equations they use clear, commonly understandable notation rather than the terse greek alphabet soup preferred by many of the more mathematically oriented authors. It should be pointed out that about 10% of the text of this book is devoted simply as a user manual for an open source MLA package called Weka. When I first realized this I almost flipped; I really didn't want a book that was devoted to gaining a surface understanding of a particular implementation of a set of algorithms.



Daniela Witten: Using artificial intelligence to study genomes

AITopics Original Links

Raw scientific data is something like gold ore -- tons of rock containing a few precious nuggets. Daniela Witten, an assistant professor of biostatistics at the University of Washington in Seattle, is developing artificial intelligence programs to sort the slurry, helping researchers develop more personalized and effective treatments for cancer and other diseases. "In the last 10 years, the field of biology has totally transformed," says Witten, who at 27, made the Forbes list of 30 under 30 last year with time to spare. While a biologist a generation ago might have spent a career studying a single protein, leaps in technology now make it possible to measure thousands of proteins or map the DNA sequence of a cancer cell. "A single experiment can generate a gigabyte of data -- if not more," says Witten.


Amazon.com: Data Mining: (Morgan Kaufmann Series in Data Management Systems) eBook: Ian H. Witten, Eibe Frank, Mark A. Hall: Kindle Store

@machinelearnbot

First of all, I would advise to think of this as a 400-page book with a WEKA appendix. Its price is about right for a 400-page machine learning textbook, and you don't even need to know that WEKA exists for the first 400 pages. I never read any of the WEKA stuff and got tons out of the textbook part. The average explanation amounts to "There's a technique called X, where you do this... it has a couple problems, but you could try fixing them in these ways." It's great for getting a lot of machine learning and data mining ideas in your head without having to get confused by learning the math behind them.


Amazon.com: Data Mining: (Morgan Kaufmann Series in Data Management Systems) eBook: Ian H. Witten, Eibe Frank, Mark A. Hall: Kindle Store

@machinelearnbot

There exists a couple of classics of Machine learning, with various strengths and weaknesses. I'd say this is the most practical of the three books. The other two I mentioned are oriented towards theoretical underpinnings, and cataloging the rich zoology of machine learning techniques. This one tells you how to get stuff done. It even has practical advice on things you really need an expert opinion on: for example, when using data folding techniques for cross validation ... what is a good number of folds to use? This book will tell you.