# Scientific Discovery

### How We Improved Data Discovery for Data Scientists at Spotify

Not only does this provide useful information to users in the moment, but it has also helped raise awareness and increase the adoption of Lexikon. Since launching the Lexikon Slack Bot, we've seen a sustained 25% increase in the number of Lexikon links shared on Slack per week. You just listened to a track by a new artist on your Discover Weekly and you're hooked. You want to hear more and learn about the artist. So, you go to the artist page on Spotify where you can check out the most popular tracks across different albums, read an artist bio, check out playlists where people tend to discover the artist, and explore similar artists.

### Nonzero-sum Adversarial Hypothesis Testing Games

We study nonzero-sum hypothesis testing games that arise in the context of adversarial classification, in both the Bayesian as well as the Neyman-Pearson frameworks. We first show that these games admit mixed strategy Nash equilibria, and then we examine some interesting concentration phenomena of these equilibria. Our main results are on the exponential rates of convergence of classification errors at equilibrium, which are analogous to the well-known Chernoff-Stein lemma and Chernoff information that describe the error exponents in the classical binary hypothesis testing problem, but with parameters derived from the adversarial model. The results are validated through numerical experiments. Papers published at the Neural Information Processing Systems Conference.

### Chance discovery brings quantum computing using standard microchips a step closer

A study to prod an antimony nucleus (buried in the middle of this device) with magnetic fields became one with electric fields when a key wire melted a gap in it. An accidental innovation has given a dark-horse approach to quantum computing a boost. For decades, scientists have dreamed of using atomic nuclei embedded in silicon--the familiar stuff of microchips--as quantum bits, or qubits, in a superpowerful quantum computer, manipulating them with magnetic fields. Now, researchers in Australia have stumbled across a way to control such a nucleus with more-manageable electric fields, raising the prospect of controlling the qubits in much the same way as transistors in an ordinary microchip. "That's incredibly important," says Thaddeus Ladd, a research physicist at HRL Laboratories LLC., a private research company.

### Why Philip Pullman Is Obsessed with Panpsychism - Facts So Romantic

Philip Pullman is once again having a moment, thanks to the new blockbuster adaptation of His Dark Materials by the BBC and HBO. His fantasy classic--filled with witches, talking bears and "daemons" (people's alter-egos that take animal form)--is rendered in glorious steampunk detail. Pullman has also returned to the fictional world of his heroine, Lyra Belacqua, with a new trilogy, The Book of Dust, which probes more deeply into the central question of his earlier books: What is the nature of consciousness? Pullman loves to write about big ideas, and recent scientific discoveries about dark matter and the Higgs boson have inspired certain plot elements in his novels. The biggest mystery in these books--an enigmatic substance called Dust--comes right out of current debates among scientists and philosophers about the origins of consciousness and the provocative theory of panpsychism.

### PAPRIKA: Private Online False Discovery Rate Control

In the modern era of big data, data analyses play an important role in decision-making in healthcare, information technology, and government agencies. The growing availability of large-scale datasets and ease of data analysis, while beneficial to society, has created a severe crisis of reproducibility in science. In 2011, Bayer HealthCare reviewed 67 in-house projects and found that they could replicate fewer than 25 percent, and found that over two-thirds of the projects had major inconsistencies [oSEM19]. One major reason is that random noise in the data can often be mistaken for interesting signals, which does not lead to valid and reproducible results. This problem is particularly relevant when testing multiple hypotheses, when there is an increased chance of false discoveries based on noise in the data. For example, an analyst may conduct 250 hypothesis tests and find that 11 are significant at the 5% level. This may be exciting to the researcher who publishes a paper based on these findings, but elementary statistics suggests that (in expectation) 12.5 of those tests should be significant at that level purely by chance, even if the null hypotheses were all true. To avoid such problems, statisticians have developed tools for controlling overall error rates when performing multiple hypothesis tests. In hypothesis testing problems, the null hypothesis of no interesting scientific discovery (e.g., a drug has no effect), is tested against the alternative hypothesis of a particular scientific theory being true (e.g., a drug

### Robust Hypothesis Testing Using Wasserstein Uncertainty Sets

We develop a novel computationally efficient and general framework for robust hypothesis testing. The new framework features a new way to construct uncertainty sets under the null and the alternative distributions, which are sets centered around the empirical distribution defined via Wasserstein metric, thus our approach is data-driven and free of distributional assumptions. We develop a convex safe approximation of the minimax formulation and show that such approximation renders a nearly-optimal detector among the family of all possible tests. By exploiting the structure of the least favorable distribution, we also develop a tractable reformulation of such approximation, with complexity independent of the dimension of observation space and can be nearly sample-size-independent in general. Real-data example using human activity data demonstrated the excellent performance of the new robust detector.

### Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the uncertainty' associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical significance as confidence intervals or p-values. We consider here a broad class of regression problems, and propose an efficient algorithm for constructing confidence intervals and p-values.

### Adaptive Active Hypothesis Testing under Limited Information

We consider the problem of active sequential hypothesis testing where a Bayesian decision maker must infer the true hypothesis from a set of hypotheses. The decision maker may choose for a set of actions, where the outcome of an action is corrupted by independent noise. In this paper we consider a special case where the decision maker has limited knowledge about the distribution of observations for each action, in that only a binary value is observed. Our objective is to infer the true hypothesis with low error, while minimizing the number of action sampled. Our main results include the derivation of a lower bound on sample size for our system under limited knowledge and the design of an active learning policy that matches this lower bound and outperforms similar known algorithms.

### Hypothesis Testing in Unsupervised Domain Adaptation with Applications in Alzheimer's Disease

Our goal is to perform a statistical test checking if $P_{\rm source}$ $P_{\rm target}$ while removing the distortions induced by the transformations. This problem is closely related to concepts underlying numerous domain adaptation algorithms, and in our case, is motivated by the need to combine clinical and imaging based biomarkers from multiple sites and/or batches, where this problem is fairly common and an impediment in the conduct of analyses with much larger sample sizes. We develop a framework that addresses this problem using ideas from hypothesis testing on the transformed measurements, where in the distortions need to be estimated {\it in tandem} with the testing. We derive a simple algorithm and study its convergence and consistency properties in detail, and we also provide lower-bound strategies based on recent work in continuous optimization. On a dataset of individuals at risk for neurological disease, our results are competitive with alternative procedures that are twice as expensive and in some cases operationally infeasible to implement.

### A Novel Kuhnian Ontology for Epistemic Classification of STM Scholarly Articles

Thomas Kuhn proposed his paradigmatic view of scientific discovery five decades ago. The concept of paradigm has not only explained the progress of science, but has also become the central epistemic concept among STM scientists. Here, we adopt the principles of Kuhnian philosophy to construct a novel ontology aims at classifying and evaluating the impact of STM scholarly articles. First, we explain how the Kuhnian cycle of science describes research at different epistemic stages. Second, we show how the Kuhnian cycle could be reconstructed into modular ontologies which classify scholarly articles according to their contribution to paradigm-centred knowledge. The proposed ontology and its scenarios are discussed. To the best of the authors knowledge, this is the first attempt for creating an ontology for describing scholarly articles based on the Kuhnian paradigmatic view of science.