Goto

Collaborating Authors

 Scientific Discovery


AI Hilbert: A New Paradigm for Scientific Discovery by Unifying Data and Background Knowledge

arXiv.org Artificial Intelligence

The discovery of scientific formulae that parsimoniously explain natural phenomena and align with existing background theory is a key goal in science. Historically, scientists have derived natural laws by manipulating equations based on existing knowledge, forming new equations, and verifying them experimentally. In recent years, data-driven scientific discovery has emerged as a viable competitor in settings with large amounts of experimental data. Unfortunately, data-driven methods often fail to discover valid laws when data is noisy or scarce. Accordingly, recent works combine regression and reasoning to eliminate formulae inconsistent with background theory. However, the problem of searching over the space of formulae consistent with background theory to find one that fits the data best is not well-solved. We propose a solution to this problem when all axioms and scientific laws are expressible via polynomial equalities and inequalities and argue that our approach is widely applicable. We further model notions of minimal complexity using binary variables and logical constraints, solve polynomial optimization problems via mixed-integer linear or semidefinite optimization, and prove the validity of our scientific discoveries in a principled manner using Positivestellensatz certificates. Remarkably, the optimization techniques leveraged in this paper allow our approach to run in polynomial time with fully correct background theory, or non-deterministic polynomial (NP) time with partially correct background theory. We demonstrate that some famous scientific laws, including Kepler's Third Law of Planetary Motion, the Hagen-Poiseuille Equation, and the Radiated Gravitational Wave Power equation, can be derived in a principled manner from background axioms and experimental data.


Gender-specific warning signs of cardiac arrest are revealed in study: 'New paradigm for prevention'

FOX News

Dr. Craig Basman discusses new life-saving technology and the variables that can predict sudden cardiac events. Half of those who suffer cardiac arrest experience a telling symptom 24 hours before the incident, according to a study recently published in The Lancet Digital Health journal. This warning symptom was different in men and in women, researchers from Smidt Heart Institute found; the institute is located in the Cedars Sinai Medical Center in Los Angeles. For women, shortness of breath was the symptom that preceded an impending cardiac arrest, while for men, chest pain was the prominent complaint. SKIPPING THE SALT CAN REDUCE HEART DISEASE RISK BY ALMOST 20%, STUDY FINDS: 'KNOW WHAT YOU ARE CONSUMING' Sweating and seizure-like activity occurred in smaller subgroups of both genders, the researchers noted.


Kochi dementia care center aims to set new paradigm in Japan

The Japan Times

Shinobu Yamanaka apologized the moment this reporter arrived for an interview at a day care facility in the city of Konan, Kochi Prefecture, one muggy morning in July. "Sorry, I had completely forgotten about it," she said with a smile at Day Service Happy, a traditional Japanese-style house converted into a day care center for people with dementia and other health conditions in need of nursing care. "I must leave for another appointment at a local elementary school soon." Yamanaka, a vivacious 46-year-old woman who has her short hair dyed ash blonde, has early-onset Alzheimer's. She often has memory lapses like the one that morning, she later confided.


Kernel Robust Hypothesis Testing

arXiv.org Artificial Intelligence

The problem of robust hypothesis testing is studied, where under the null and the alternative hypotheses, the data-generating distributions are assumed to be in some uncertainty sets, and the goal is to design a test that performs well under the worst-case distributions over the uncertainty sets. In this paper, uncertainty sets are constructed in a data-driven manner using kernel method, i.e., they are centered around empirical distributions of training samples from the null and alternative hypotheses, respectively; and are constrained via the distance between kernel mean embeddings of distributions in the reproducing kernel Hilbert space, i.e., maximum mean discrepancy (MMD). The Bayesian setting and the Neyman-Pearson setting are investigated. For the Bayesian setting where the goal is to minimize the worst-case error probability, an optimal test is firstly obtained when the alphabet is finite. When the alphabet is infinite, a tractable approximation is proposed to quantify the worst-case average error probability, and a kernel smoothing method is further applied to design test that generalizes to unseen samples. A direct robust kernel test is also proposed and proved to be exponentially consistent. For the Neyman-Pearson setting, where the goal is to minimize the worst-case probability of miss detection subject to a constraint on the worst-case probability of false alarm, an efficient robust kernel test is proposed and is shown to be asymptotically optimal. Numerical results are provided to demonstrate the performance of the proposed robust tests. Hypothesis testing is a fundamental problem in statistical inference where the goal is to distinguish among different hypotheses with a small probability of error [3]-[5]. The likelihood ratio test is known to be optimal under different settings, e.g., the Neyman-Pearson setting and the Bayesian setting [3], [5]. For example, for binary hypothesis testing, we compare the likelihood ratio between the two hypotheses with a pre-specified threshold to make the decision.


Chance discovery helps fight against malaria

BBC News

Scientists at a research facility in Spain, run by the GSK pharmaceutical company, made the discovery after noticing that a colony of mosquitoes being used for drug development had stopped carrying malaria.


The Paradigm Shifts in Artificial Intelligence

arXiv.org Artificial Intelligence

Kuhn's framework of scientific progress (Kuhn, 1962) provides a useful framing of the paradigm shifts that have occurred in Artificial Intelligence over the last 60 years. The framework is also useful in understanding what is arguably a new paradigm shift in AI, signaled by the emergence of large pre-trained systems such as GPT-3, on which conversational agents such as ChatGPT are based. Such systems make intelligence a commoditized general purpose technology that is configurable to applications. In this paper, I summarize the forces that led to the rise and fall of each paradigm, and discuss the pressing issues and risks associated with the current paradigm shift in AI.


Interpretable Machine Learning for Discovery: Statistical Challenges \& Opportunities

arXiv.org Artificial Intelligence

Machine learning systems have gained widespread use in science, technology, and society. Given the increasing number of high-stakes machine learning applications and the growing complexity of machine learning models, many have advocated for interpretability and explainability to promote understanding and trust in machine learning results (Rasheed et al., 2022, Toreini et al., 2020, Broderick et al., 2023). In response, there has been a recent explosion of research on Interpretable Machine Learning (IML), mostly focusing on new techniques to interpret black-box systems; see Molnar (2022), Lipton (2018), Guidotti et al. (2018), Doshi-Velez & Kim (2017), Du et al. (2019), Murdoch et al. (2019), Carvalho et al. (2019) for recent reviews of the IML and explainable artificial intelligence literature. While most of these interpretability techniques were not necessarily designed for this purpose, they are increasingly being used to mine large and complex data sets to generate new insights (Roscher et al., 2020). These so-called data-driven discoveries are especially important to advance data-rich fields in science, technology, and medicine. While prior reviews focus mainly on IML techniques, we primarily review how IML methods promote data-driven discoveries, challenges associated with this task, and related new research opportunities at the intersection of machine learning and statistics. In the sciences and beyond, IML techniques are routinely employed to make new discoveries from large and complex data sets; to motivate our review on this topic, we highlight several examples. First, feature importance and feature selection in supervised learning are popular forms of interpretation that have led to major discoveries like discovering new genomic biomarkers of diseases (Guyon et al., 2002), discovering physical laws governing dynamical systems (Brunton et al., 2016), and discovering lesions and other abnormalities in radiology (Borjali et al., 2020, Reyes et al., 2020). While most of the IML literature focuses on supervised learning (Molnar, 2022, Lipton, 2018, Guidotti et al., 2018, Doshi-Velez & Kim, 2017), there have been many major scientific discoveries made via unsupervised techniques and we argue that these approaches


A Computational Inflection for Scientific Discovery

Communications of the ACM

We leverage research in natural language processing (NLP), information retrieval, data mining, and human-computer interaction (HCI) and draw concepts from multiple disciplines. For example, efforts in metascience focus on sociological factors that influence the evolution of science,17 such as analyses of information silos that impede mutual understanding and interaction,38 of macro-scale ramifications of the rapid growth in scholarly publications,4 and of current metrics for measuring impact5--work enabled by digitization of scholarly corpora. Metascience research makes important observations about human biases (desideratum 2) but generally does not engage in building computational interventions to augment researchers (desideratum 1). Conversely, work in literature-based discovery33 mines information from literature to generate new predictions (for example, functions of materials or drug targets) but is typically done in isolation from cognitive considerations; however, these techniques have great promise in being used as part of human-augmentation systems. Other work uses machines to automate aspects of science.


Cross Modal Data Discovery over Structured and Unstructured Data Lakes

arXiv.org Artificial Intelligence

Organizations are collecting increasingly large amounts of data for data driven decision making. These data are often dumped into a centralized repository, e.g., a data lake, consisting of thousands of structured and unstructured datasets. Perversely, such mixture of datasets makes the problem of discovering elements (e.g., tables or documents) that are relevant to a user's query or an analytical task very challenging. Despite the recent efforts in data discovery, the problem remains widely open especially in the two fronts of (1) discovering relationships and relatedness across structured and unstructured datasets where existing techniques suffer from either scalability, being customized for a specific problem type (e.g., entity matching or data integration), or demolishing the structural properties on its way, and (2) developing a holistic system for integrating various similarity measurements and sketches in an effective way to boost the discovery accuracy. In this paper, we propose a new data discovery system, named CMDL, for addressing these two limitations. CMDL supports the data discovery process over both structured and unstructured data while retaining the structural properties of tables.


LakeBench: Benchmarks for Data Discovery over Data Lakes

arXiv.org Artificial Intelligence

Within enterprises, there is a growing need to intelligently navigate data lakes, specifically focusing on data discovery. Of particular importance to enterprises is the ability to find related tables in data repositories. These tables can be unionable, joinable, or subsets of each other. There is a dearth of benchmarks for these tasks in the public domain, with related work targeting private datasets. In LakeBench, we develop multiple benchmarks for these tasks by using the tables that are drawn from a diverse set of data sources such as government data from CKAN, Socrata, and the European Central Bank. We compare the performance of 4 publicly available tabular foundational models on these tasks. None of the existing models had been trained on the data discovery tasks that we developed for this benchmark; not surprisingly, their performance shows significant room for improvement. The results suggest that the establishment of such benchmarks may be useful to the community to build tabular models usable for data discovery in data lakes.