Scientific Discovery


Robust Hypothesis Testing Using Wasserstein Uncertainty Sets

Neural Information Processing Systems

We develop a novel computationally efficient and general framework for robust hypothesis testing. The new framework features a new way to construct uncertainty sets under the null and the alternative distributions, which are sets centered around the empirical distribution defined via Wasserstein metric, thus our approach is data-driven and free of distributional assumptions. We develop a convex safe approximation of the minimax formulation and show that such approximation renders a nearly-optimal detector among the family of all possible tests. By exploiting the structure of the least favorable distribution, we also develop a tractable reformulation of such approximation, with complexity independent of the dimension of observation space and can be nearly sample-size-independent in general. Real-data example using human activity data demonstrated the excellent performance of the new robust detector.


Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

Neural Information Processing Systems

Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the uncertainty' associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical significance as confidence intervals or p-values. We consider here a broad class of regression problems, and propose an efficient algorithm for constructing confidence intervals and p-values.


Nonzero-sum Adversarial Hypothesis Testing Games

Neural Information Processing Systems

We study nonzero-sum hypothesis testing games that arise in the context of adversarial classification, in both the Bayesian as well as the Neyman-Pearson frameworks. We first show that these games admit mixed strategy Nash equilibria, and then we examine some interesting concentration phenomena of these equilibria. Our main results are on the exponential rates of convergence of classification errors at equilibrium, which are analogous to the well-known Chernoff-Stein lemma and Chernoff information that describe the error exponents in the classical binary hypothesis testing problem, but with parameters derived from the adversarial model. The results are validated through numerical experiments. Papers published at the Neural Information Processing Systems Conference.


Adaptive Active Hypothesis Testing under Limited Information

Neural Information Processing Systems

We consider the problem of active sequential hypothesis testing where a Bayesian decision maker must infer the true hypothesis from a set of hypotheses. The decision maker may choose for a set of actions, where the outcome of an action is corrupted by independent noise. In this paper we consider a special case where the decision maker has limited knowledge about the distribution of observations for each action, in that only a binary value is observed. Our objective is to infer the true hypothesis with low error, while minimizing the number of action sampled. Our main results include the derivation of a lower bound on sample size for our system under limited knowledge and the design of an active learning policy that matches this lower bound and outperforms similar known algorithms.


Hypothesis Testing in Unsupervised Domain Adaptation with Applications in Alzheimer's Disease

Neural Information Processing Systems

Our goal is to perform a statistical test checking if $P_{\rm source}$ $P_{\rm target}$ while removing the distortions induced by the transformations. This problem is closely related to concepts underlying numerous domain adaptation algorithms, and in our case, is motivated by the need to combine clinical and imaging based biomarkers from multiple sites and/or batches, where this problem is fairly common and an impediment in the conduct of analyses with much larger sample sizes. We develop a framework that addresses this problem using ideas from hypothesis testing on the transformed measurements, where in the distortions need to be estimated {\it in tandem} with the testing. We derive a simple algorithm and study its convergence and consistency properties in detail, and we also provide lower-bound strategies based on recent work in continuous optimization. On a dataset of individuals at risk for neurological disease, our results are competitive with alternative procedures that are twice as expensive and in some cases operationally infeasible to implement.


These are the top 20 scientific discoveries of the decade

#artificialintelligence

To understand the natural world, scientists must measure it--but how do we define our units? Over the decades, scientists have gradually redefined classic units in terms of universal constants, such as using the speed of light to help define the length of a meter. But the scientific unit of mass, the kilogram, remained pegged to "Le Grand K," a metallic cylinder stored at a facility in France. If that ingot's mass varied for whatever reason, scientists would have to recalibrate their instruments. No more: In 2019, scientists agreed to adopt a new kilogram definition based on a fundamental factor in physics called Planck's constant and the improved definitions for the units of electrical current, temperature, and the number of particles in a given substance.


Bitlattice - the new paradigm

#artificialintelligence

Bitlattice has or can have implemented instrumentation needed to act as a neural network. That idea is wild, but ultimately possible and potentially beneficial. While the globe wide network in this mode won't be fast (due to physical limitations of signals speed and delays of network) the fact that the middle layer contains far less nodes than actual number of participating devices makes that idea at least possible to implement. The practical aspect here could be, for instance, making a "feeling planet" like project.


Data Discovery and Lineage Simplified for Cloud Analytics

#artificialintelligence

Findings show that data practitioners spend a majority (up to 80%1) of their time on data wrangling instead of mining data for analytics and machine learning projects. Organizations want to find trusted datasets so they gain visibility into workloads across data sources as well as their upstream and downstream impact. Take the first step towards successful cloud modernization with Databricks and Informatica. The partnership provides an end-to-end data discovery and lineage enabled by Informatica's AI-powered Enterprise Data Catalog that helps enterprises be highly strategic about data engineering with complete visibility into their data stack. Register now to see an in-depth demo of the Databricks and Informatica joint solution for data lineage.


Hypothesis Testing in Machine Learning: What for and Why

#artificialintelligence

Suppose you are working on a machine learning project, for which you want to predict if a set of patients have or not a mortal disease, based on several features on your dataset as blood pressure, heart rate, pulse and others. Sounds like a serious project, for which you'll need to really trust your model and predictions, right? That's why you got hundreds of samples, that your local hospital very gently allowed you to collect, given the importance and the seriousness of the topic. But how do you know if your sample is representative of the whole population? And how can we know how much difference might be reasonable?


A data scientist calls for caution in trusting AI discoveries Science News

#artificialintelligence

We live in a golden age of scientific data, with larger stockpiles of genetic information, medical images and astronomical observations than ever before. Artificial intelligence can pore over these troves to uncover potential new scientific discoveries much quicker than people ever could. But we should not blindly trust AI's scientific insights, argues data scientist Genevera Allen, until these computer programs can better gauge how certain they are in their own results. AI systems that use machine learning -- programs that learn what to do by studying data rather than following explicit instructions -- can be entrusted with some decisions, says Allen, of Rice University in Houston. Namely, AI is reliable for making decisions in areas where humans can easily check their work, like counting craters on the moon or predicting earthquake aftershocks (SN: 12/22/18, p. 25).