AITopics | Scientific Discovery

Collaborating Authors

Scientific Discovery

"The problem of giving rules for producing true scientific statements has been replaced by the problem of finding efficient heuristic rules for culling the reasonable candidates for an explanation from an appropriate set of possible candidates [and finding methods for constructing the candidates]."
– B. Buchanan, quoted in Lindley Darden. Recent Work in Computational Scientific Discovery.

News Overviews Instructional Materials AI-Alerts Classics

Hypothesis Testing in Unsupervised Domain Adaptation with Applications in Alzheimer's Disease †

Neural Information Processing SystemsMar-12-2024, 14:58:35 GMT

This problem is closely related to domain adaptation, and in our case, is motivated by the need to combine clinical and imaging based biomarkers from multiple sites and/or batches - a fairly common impediment in conducting analyses with much larger sample sizes. We address this problem using ideas from hypothesis testing on the transformed measurements, wherein the distortions need to be estimated in tandem with the testing. We derive a simple algorithm and study its convergence and consistency properties in detail, and provide lower-bound strategies based on recent work in continuous optimization. On a dataset of individuals at risk for Alzheimer's disease, our framework is competitive with alternative procedures that are twice as expensive and in some cases operationally infeasible to implement.

adaptation, domain adaptation, transformation, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > New York (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.68)

Industry: Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (0.61)

Add feedback

Statistical Inference for Pairwise Graphical Models Using Score Matching

Neural Information Processing SystemsMar-12-2024, 10:29:59 GMT

Probabilistic graphical models have been widely used to model complex systems and aid scientific discoveries. As a result, there is a large body of literature focused on consistent model selection. However, scientists are often interested in understanding uncertainty associated with the estimated parameters, which current literature has not addressed thoroughly. In this paper, we propose a novel estimator for edge parameters for pairwise graphical models based on Hyvärinen scoring rule. Hyvärinen scoring rule is especially useful in cases where the normalizing constant cannot be obtained efficiently in a closed form.

estimator, graphical model, procedure, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Cook County > Chicago (0.05)
North America > United States > New York (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.52)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (0.34)

Add feedback

Data-driven Discovery with Large Generative Models

Majumder, Bodhisattwa Prasad, Surana, Harshit, Agarwal, Dhruv, Hazra, Sanchaita, Sabharwal, Ashish, Clark, Peter

arXiv.org Artificial IntelligenceFeb-21-2024

With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery -- a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DATAVOYAGER, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata -- a feat previously unattainable -- while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility.

corpusid, data-driven discovery, semanticscholar, (14 more...)

arXiv.org Artificial Intelligence

2402.1361

Country:

North America > United States > Utah (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > Monaco (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.46)
Law > Intellectual Property & Technology Law (0.46)
Information Technology > Services (0.46)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
(3 more...)

Add feedback

A unified Bayesian framework for interval hypothesis testing in clinical trials

Chakraborty, Abhisek, Murray, Megan H., Lipkovich, Ilya, Du, Yu

arXiv.org Machine LearningFeb-21-2024

The American Statistical Association (ASA) statement on statistical significance and P-values \cite{wasserstein2016asa} cautioned statisticians against making scientific decisions solely on the basis of traditional P-values. The statement delineated key issues with P-values, including a lack of transparency, an inability to quantify evidence in support of the null hypothesis, and an inability to measure the size of an effect or the importance of a result. In this article, we demonstrate that the interval null hypothesis framework (instead of the point null hypothesis framework), when used in tandem with Bayes factor-based tests, is instrumental in circumnavigating the key issues of P-values. Further, we note that specifying prior densities for Bayes factors is challenging and has been a reason for criticism of Bayesian hypothesis testing in existing literature. We address this by adapting Bayes factors directly based on common test statistics. We demonstrate, through numerical experiments and real data examples, that the proposed Bayesian interval hypothesis testing procedures can be calibrated to ensure frequentist error control while retaining their inherent interpretability. Finally, we illustrate the improved flexibility and applicability of the proposed methods by providing coherent frameworks for competitive landscape analysis and end-to-end Bayesian hypothesis tests in the context of reporting clinical trial outcomes.

hypothesis, insulin glargine, posterior probability, (14 more...)

arXiv.org Machine Learning

2402.1389

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Texas > Brazos County > College Station (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > Indiana > Marion County > Indianapolis (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (0.81)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.50)

Add feedback

Quantitative causality, causality-guided scientific discovery, and causal machine learning

Liang, X. San, Chen, Dake, Zhang, Renhe

arXiv.org Artificial IntelligenceFeb-20-2024

It has been said, arguably, that causality analysis should pave a promising way to interpretable deep learning and generalization. Incorporation of causality into artificial intelligence (AI) algorithms, however, is challenged with its vagueness, non-quantitiveness, computational inefficiency, etc. During the past 18 years, these challenges have been essentially resolved, with the establishment of a rigorous formalism of causality analysis initially motivated from atmospheric predictability. This not only opens a new field in the atmosphere-ocean science, namely, information flow, but also has led to scientific discoveries in other disciplines, such as quantum mechanics, neuroscience, financial economics, etc., through various applications. This note provides a brief review of the decade-long effort, including a list of major theoretical results, a sketch of the causal deep learning framework, and some representative real-world applications in geoscience pertaining to this journal, such as those on the anthropogenic cause of global warming, the decadal prediction of El Niño Modoki, the forecasting of an extreme drought in China, among others. Keywords: Causality, Liang-Kleeman information flow, Causal artificial intelligence, Fuzzy cognitive map, Interpretability, Frobenius-Perron operator, Weather/Climate forecasting 1. Introduction Causality analysis is a fundamental problem in scientific research, as commented by Einstein in 1953 in response to a question on the status quo of science in China at that time (cf. the historical record in Hu, 2005).The recent rush in artificial intelligence (AI) has stimulated enormous interest in causal inference, partly due to the realization that it may take the field to the next level to approach human intelligence (see Pearl, 2018; Bengio, 2019; Schölkopf, 2022). In the fields pertaining to this journal, assessment of the cause-effect relations between dynamic events makes a natural objective for the corresponding researches.

causality, information flow, liang, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.34133/olar.0026

2402.13427

Country:

Pacific Ocean > North Pacific Ocean > South China Sea (0.05)
Antarctica (0.04)
North America > United States > New York > New York County > New York City (0.04)
(6 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.75)
Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (0.61)

Add feedback

Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data

Liu, Haoyang, Li, Yijiang, Jian, Jinglin, Cheng, Yuxuan, Lu, Jianrong, Guo, Shuyi, Zhu, Jinglei, Zhang, Mianchen, Zhang, Miantong, Wang, Haohan

arXiv.org Artificial IntelligenceFeb-20-2024

Machine learning has emerged as a powerful tool for scientific discovery, enabling researchers to extract meaningful insights from complex datasets. For instance, it has facilitated the identification of disease-predictive genes from gene expression data, significantly advancing healthcare. However, the traditional process for analyzing such datasets demands substantial human effort and expertise for the data selection, processing, and analysis. To address this challenge, we introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline. TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM). These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes. Furthermore, we have curated a benchmark dataset to assess TAIS's effectiveness in gene identification, demonstrating our system's potential to significantly enhance the efficiency and scope of scientific exploration. Our findings represent a solid step towards automating scientific discovery through large language models.

arxiv preprint arxiv, cancer, data engineer, (13 more...)

arXiv.org Artificial Intelligence

2402.12391

Country:

North America > United States > Illinois (0.04)
Europe > Poland > Greater Poland Province > Poznań (0.04)
Europe > Netherlands (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Gastroenterology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

A New Paradigm for Counterfactual Reasoning in Fairness and Recourse

Bynum, Lucius E. J., Loftus, Joshua R., Stoyanovich, Julia

arXiv.org Artificial IntelligenceJan-24-2024

Counterfactuals and counterfactual reasoning underpin numerous techniques for auditing and understanding artificial intelligence (AI) systems. The traditional paradigm for counterfactual reasoning in this literature is the interventional counterfactual, where hypothetical interventions are imagined and simulated. For this reason, the starting point for causal reasoning about legal protections and demographic data in AI is an imagined intervention on a legally-protected characteristic, such as ethnicity, race, gender, disability, age, etc. We ask, for example, what would have happened had your race been different? An inherent limitation of this paradigm is that some demographic interventions -- like interventions on race -- may not translate into the formalisms of interventional counterfactuals. In this work, we explore a new paradigm based instead on the backtracking counterfactual, where rather than imagine hypothetical interventions on legally-protected characteristics, we imagine alternate initial conditions while holding these characteristics fixed. We ask instead, what would explain a counterfactual outcome for you as you actually are or could be? This alternate framework allows us to address many of the same social concerns, but to do so while asking fundamentally different questions that do not rely on demographic interventions.

counterfactual, fairness, intervention, (15 more...)

arXiv.org Artificial Intelligence

2401.13935

Country: North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.64)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (0.60)
Information Technology > Artificial Intelligence > Cognitive Science > Creativity & Intelligence (0.60)

Add feedback

When Graph Data Meets Multimodal: A New Paradigm for Graph Understanding and Reasoning

Ai, Qihang, Zhou, Jianwu, Jiang, Haiyun, Liu, Lemao, Shi, Shuming

arXiv.org Artificial IntelligenceDec-16-2023

Graph data is ubiquitous in the physical world, and it has always been a challenge to efficiently model graph structures using a unified paradigm for the understanding and reasoning on various graphs. Moreover, in the era of large language models, integrating complex graph information into text sequences has become exceptionally difficult, which hinders the ability to interact with graph data through natural language instructions.The paper presents a new paradigm for understanding and reasoning about graph data by integrating image encoding and multimodal technologies. This approach enables the comprehension of graph data through an instruction-response format, utilizing GPT-4V's advanced capabilities. The study evaluates this paradigm on various graph types, highlighting the model's strengths and weaknesses, particularly in Chinese OCR performance and complex reasoning tasks. The findings suggest new direction for enhancing graph data processing and natural language interaction.

gpt-4v, graph, instruction, (15 more...)

arXiv.org Artificial Intelligence

2312.10372

Country:

North America > United States > Nevada > Clark County > Las Vegas (0.04)
North America > United States > California (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report > New Finding (0.88)

Industry: Information Technology (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (0.60)
Information Technology > Artificial Intelligence > Cognitive Science > Creativity & Intelligence (0.60)
(2 more...)

Add feedback

Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints

Pensia, Ankit, Asadi, Amir R., Jog, Varun, Loh, Po-Ling

arXiv.org Machine LearningDec-15-2023

We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal A$ of two distributions $p$ and $q$, defined as $\mathcal A := \{(\mathbf T p, \mathbf T q) | \mathbf T \in \mathcal C\}$, where $\mathcal C$ is the set of channels characterizing the constraints.

constraint, extreme point, sample complexity, (14 more...)

arXiv.org Machine Learning

2301.03566

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (0.82)

Add feedback

Communication-constrained hypothesis testing: Optimality, robustness, and reverse data processing inequalities

Pensia, Ankit, Jog, Varun, Loh, Po-Ling

arXiv.org Machine LearningDec-15-2023

We study hypothesis testing under communication constraints, where each sample is quantized before being revealed to a statistician. Without communication constraints, it is well known that the sample complexity of simple binary hypothesis testing is characterized by the Hellinger distance between the distributions. We show that the sample complexity of simple binary hypothesis testing under communication constraints is at most a logarithmic factor larger than in the unconstrained setting and this bound is tight. We develop a polynomial-time algorithm that achieves the aforementioned sample complexity. Our framework extends to robust hypothesis testing, where the distributions are corrupted in the total variation distance. Our proofs rely on a new reverse data processing inequality and a reverse Markov inequality, which may be of independent interest. For simple $M$-ary hypothesis testing, the sample complexity in the absence of communication constraints has a logarithmic dependence on $M$. We show that communication constraints can cause an exponential blow-up leading to $\Omega(M)$ sample complexity even for adaptive algorithms.

artificial intelligence, inequality, scientific discovery, (16 more...)

arXiv.org Machine Learning

2206.02765

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Information Technology > Software (0.61)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (1.00)

Add feedback