Collaborating Authors


A tutorial on MDL hypothesis testing for graph analysis Machine Learning

When analysing graph structure, it can be difficult to determine whether patterns found are due to chance, or due to structural aspects of the process that generated the data. Hypothesis tests are often used to support such analyses. These allow us to make statistical inferences about which null models are responsible for the data, and they can be used as a heuristic in searching for meaningful patterns. The minimum description length (MDL) principle [6, 4] allows us to build such hypothesis tests, based on efficient descriptions of the data. Broadly: we translate the regularity we are interested in into a code for the data, and if this code describes the data more efficiently than a code corresponding to the null model, by a sufficient margin, we may reject the null model. This is a frequentist approach to MDL, based on hypothesis testing. Bayesian approaches to MDL for model selection rather than model rejection are more common, but for the purposes of pattern analysis, a hypothesis testing approach provides a more natural fit with existing literature. 1 We provide a brief illustration of this principle based on the running example of analysing the size of the largest clique in a graph. We illustrate how a code can be constructed to efficiently represent graphs with large cliques, and how the description length of the data under this code can be compared to the description length under a code corresponding to a null model to show that the null model is highly unlikely to have generated the data.

The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective


Many people have found the table above to be useful for understanding two conceptual distinctions in the practice of data analysis. The article that discusses the table, and many other issues, is now in press. The in-press version can be found at OSF and at SSRN. Abstract: In the practice of data analysis, there is a conceptual distinction between hypothesis testing, on the one hand, and estimation with quantified uncertainty, on the other hand. Among frequentists in psychology a shift of emphasis from hypothesis testing to estimation has been dubbed "the New Statistics" (Cumming, 2014).

Accelerating Science: A Computing Research Agenda Artificial Intelligence

The emergence of "big data" offers unprecedented opportunities for not only accelerating scientific advances but also enabling new modes of discovery. Scientific progress in many disciplines is increasingly enabled by our ability to examine natural phenomena through the computational lens, i.e., using algorithmic or information processing abstractions of the underlying processes; and our ability to acquire, share, integrate and analyze disparate types of data. However, there is a huge gap between our ability to acquire, store, and process data and our ability to make effective use of the data to advance discovery. Despite successful automation of routine aspects of data management and analytics, most elements of the scientific process currently require considerable human expertise and effort. Accelerating science to keep pace with the rate of data acquisition and data processing calls for the development of algorithmic or information processing abstractions, coupled with formal methods and tools for modeling and simulation of natural processes as well as major innovations in cognitive tools for scientists, i.e., computational tools that leverage and extend the reach of human intellect, and partner with humans on a broad range of tasks in scientific discovery (e.g., identifying, prioritizing formulating questions, designing, prioritizing and executing experiments designed to answer a chosen question, drawing inferences and evaluating the results, and formulating new questions, in a closed-loop fashion). This calls for concerted research agenda aimed at: Development, analysis, integration, sharing, and simulation of algorithmic or information processing abstractions of natural processes, coupled with formal methods and tools for their analyses and simulation; Innovations in cognitive tools that augment and extend human intellect and partner with humans in all aspects of science.

Notes on a New Philosophy of Empirical Science Machine Learning

This book presents a methodology and philosophy of empirical science based on large scale lossless data compression. In this view a theory is scientific if it can be used to build a data compression program, and it is valuable if it can compress a standard benchmark database to a small size, taking into account the length of the compressor itself. This methodology therefore includes an Occam principle as well as a solution to the problem of demarcation. Because of the fundamental difficulty of lossless compression, this type of research must be empirical in nature: compression can only be achieved by discovering and characterizing empirical regularities in the data. Because of this, the philosophy provides a way to reformulate fields such as computer vision and computational linguistics as empirical sciences: the former by attempting to compress databases of natural images, the latter by attempting to compress large text databases. The book argues that the rigor and objectivity of the compression principle should set the stage for systematic progress in these fields. The argument is especially strong in the context of computer vision, which is plagued by chronic problems of evaluation. The book also considers the field of machine learning. Here the traditional approach requires that the models proposed to solve learning problems be extremely simple, in order to avoid overfitting. However, the world may contain intrinsically complex phenomena, which would require complex models to understand. The compression philosophy can justify complex models because of the large quantity of data being modeled (if the target database is 100 Gb, it is easy to justify a 10 Mb model). The complex models and abstractions learned on the basis of the raw data (images, language, etc) can then be reused to solve any specific learning problem, such as face recognition or machine translation.