Scientific Discovery
Reviews: Nonzero-sum Adversarial Hypothesis Testing Games
The paper proposes a new adversarial framework for hypothesis testing, in a game-theoretic setup. The main positives are: the formulation bridges many fields including statistics, property testing, game-theory, and has the potential to inspire much future work. The theoretical results are reasonable but somewhat unsurprising.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Yan, Yibo, Wang, Shen, Huo, Jiahao, Ye, Jingheng, Chu, Zhendong, Hu, Xuming, Yu, Philip S., Gomes, Carla, Selman, Bart, Wen, Qingsong
Scientific reasoning, the process through which humans apply logic, evidence, and critical thinking to explore and interpret scientific phenomena, is essential in advancing knowledge reasoning across diverse fields. However, despite significant progress, current scientific reasoning models still struggle with generalization across domains and often fall short of multimodal perception. Multimodal Large Language Models (MLLMs), which integrate text, images, and other modalities, present an exciting opportunity to overcome these limitations and enhance scientific reasoning. Therefore, this position paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology. First, we propose a four-stage research roadmap of scientific reasoning capabilities, and highlight the current state of MLLM applications in scientific reasoning, noting their ability to integrate and reason over diverse data types. Second, we summarize the key challenges that remain obstacles to achieving MLLM's full potential. To address these challenges, we propose actionable insights and suggestions for the future. Overall, our work offers a novel perspective on MLLM integration with scientific reasoning, providing the LLM community with a valuable vision for achieving Artificial General Intelligence (AGI).
The Query/Hit Model for Sequential Hypothesis Testing
Shariatnasab, Mahshad, Rini, Stefano, Shirani, Farhad, Iyengar, S. Sitharama
This work introduces the Query/Hit (Q/H) learning model. The setup consists of two agents. One agent, Alice, has access to a streaming source, while the other, Bob, does not have direct access to the source. Communication occurs through sequential Q/H pairs: Bob sends a sequence of source symbols (queries), and Alice responds with the waiting time until each query appears in the source stream (hits). This model is motivated by scenarios with communication, computation, and privacy constraints that limit real-time access to the source. The error exponent for sequential hypothesis testing under the Q/H model is characterized, and a querying strategy, the Dynamic Scout-Sentinel Algorithm (DSSA), is proposed. The strategy employs a mutual information neural estimator to compute the error exponent associated with each query and to select the query with the highest efficiency. Extensive empirical evaluations on both synthetic and real-world datasets -- including mouse movement trajectories, typesetting patterns, and touch-based user interactions -- are provided to evaluate the performance of the proposed strategy in comparison with baselines, in terms of probability of error, query choice, and time-to-detection.
International AI Safety Report
Bengio, Yoshua, Mindermann, Sรถren, Privitera, Daniel, Besiroglu, Tamay, Bommasani, Rishi, Casper, Stephen, Choi, Yejin, Fox, Philip, Garfinkel, Ben, Goldfarb, Danielle, Heidari, Hoda, Ho, Anson, Kapoor, Sayash, Khalatbari, Leila, Longpre, Shayne, Manning, Sam, Mavroudis, Vasilios, Mazeika, Mantas, Michael, Julian, Newman, Jessica, Ng, Kwan Yee, Okolo, Chinasa T., Raji, Deborah, Sastry, Girish, Seger, Elizabeth, Skeadas, Theodora, South, Tobin, Strubell, Emma, Tramรจr, Florian, Velasco, Lucia, Wheeler, Nicole, Acemoglu, Daron, Adekanmbi, Olubayo, Dalrymple, David, Dietterich, Thomas G., Felten, Edward W., Fung, Pascale, Gourinchas, Pierre-Olivier, Heintz, Fredrik, Hinton, Geoffrey, Jennings, Nick, Krause, Andreas, Leavy, Susan, Liang, Percy, Ludermir, Teresa, Marda, Vidushi, Margetts, Helen, McDermid, John, Munga, Jane, Narayanan, Arvind, Nelson, Alondra, Neppel, Clara, Oh, Alice, Ramchurn, Gopal, Russell, Stuart, Schaake, Marietje, Schรถlkopf, Bernhard, Song, Dawn, Soto, Alvaro, Tiedrich, Lee, Varoquaux, Gaรซl, Yao, Andrew, Zhang, Ya-Qin, Albalawi, Fahad, Alserkal, Marwan, Ajala, Olubunmi, Avrin, Guillaume, Busch, Christian, de Carvalho, Andrรฉ Carlos Ponce de Leon Ferreira, Fox, Bronwyn, Gill, Amandeep Singh, Hatip, Ahmet Halit, Heikkilรค, Juha, Jolly, Gill, Katzir, Ziv, Kitano, Hiroaki, Krรผger, Antonio, Johnson, Chris, Khan, Saif M., Lee, Kyoung Mu, Ligot, Dominic Vincent, Molchanovskyi, Oleksii, Monti, Andrea, Mwamanzi, Nusu, Nemer, Mona, Oliver, Nuria, Portillo, Josรฉ Ramรณn Lรณpez, Ravindran, Balaraman, Rivera, Raquel Pezoa, Riza, Hammam, Rugege, Crystal, Seoighe, Ciarรกn, Sheehan, Jerry, Sheikh, Haroon, Wong, Denise, Zeng, Yi
I am honoured to present the International AI Safety Report. It is the work of 96 international AI experts who collaborated in an unprecedented effort to establish an internationally shared scientific understanding of risks from advanced AI and methods for managing them. We embarked on this journey just over a year ago, shortly after the countries present at the Bletchley Park AI Safety Summit agreed to support the creation of this report. Since then, we published an Interim Report in May 2024, which was presented at the AI Seoul Summit. We are now pleased to publish the present, full report ahead of the AI Action Summit in Paris in February 2025. Since the Bletchley Summit, the capabilities of general-purpose AI, the type of AI this report focuses on, have increased further. For example, new models have shown markedly better performance at tests of Professor Yoshua Bengio programming and scientific reasoning.
Reviews: Hypothesis Testing in Unsupervised Domain Adaptation with Applications in Alzheimer's Disease
The paper presents an interesting and smart way of performing covariate shift by aiming to make the two distributions indistinguishable by minimizing MMD. The paper however could benefit of more clarity and completeness so it can make impact. In terms of the applicability of this approach, the authors talk about the importance of being able to perform statistical tests and not just optimize the performance of a classifier. The bounds that they derive for their statistical test is useful to know how big the sample size should be to perform an appropriate shift. However, this doesn't say much on how it affects the scientific questions asked in the experiment, which need different statistical test.
Reviews: Robust Hypothesis Testing Using Wasserstein Uncertainty Sets
The rebuttal addressed my technical concerns, and also I seemed to have misjudged the size of the contributions at first. My score has been updated. This paper studies the two-sample non-parametric hypothesis testing problem. Given two collections of probability distribution, the paper studies approximating the best detector against the worst distributions from both collections. A standard surrogate loss approximation is used to upper bound the worst case risk (the maximum of the type I and type II errors) with a convex surrogate function, which is known to yield a good solution.
Computational Discovery of Chiasmus in Ancient Religious Text
McGovern, Hope, Sirin, Hale, Lippincott, Tom
Chiasmus, a debated literary device in Biblical texts, has captivated mystics while sparking ongoing scholarly discussion. In this paper, we introduce the first computational approach to systematically detect chiasmus within Biblical passages. Our method leverages neural embeddings to capture lexical and semantic patterns associated with chiasmus, applied at multiple levels of textual granularity (half-verses, verses). We also involve expert annotators to review a subset of the detected patterns. Despite its computational efficiency, our method achieves robust results, with high inter-annotator agreement and system precision@k of 0.80 at the verse level and 0.60 at the half-verse level. We further provide a qualitative analysis of the distribution of detected chiasmi, along with selected examples that highlight the effectiveness of our approach.
A unified framework for bandit multiple testing
In bandit multiple hypothesis testing, each arm corresponds to a different null hypothesis that we wish to test, and the goal is to design adaptive algorithms that correctly identify large set of interesting arms (true discoveries), while only mistakenly identifying a few uninteresting ones (false discoveries). One common metric in non-bandit multiple testing is the false discovery rate (FDR). We propose a unified, modular framework for bandit FDR control that emphasizes the decoupling of exploration and summarization of evidence. We utilize the powerful martingale-based concept of "e-processes" to ensure FDR control for arbitrary composite nulls, exploration rules and stopping times in generic problem settings. In particular, valid FDR control holds even if the reward distributions of the arms could be dependent, multiple arms may be queried simultaneously, and multiple (cooperating or competing) agents may be querying arms, covering combinatorial semi-bandit type settings as well.
Blob-Headed Fish, Meat-Eating Squirrels, and Other Fascinating Science Stories From 2024
So much of this year felt like a fever dream: The attempted assassination of Donald Trump. Which is why, this year, I'm leaning into my nerdish tendencies and rounding up some good, interesting, or inspiring news stories from the science world--promising discoveries, exciting new data, historic events, and unsung heroes. In the hope of providing relief from the hell that has been 2024, here's a non-comprehensive list of the year's coolest science stories, both big and small: Wildlife filmmaker Carlos Gauna and University of California, Riverside, PhD student Phillip Sternes spotted what appears to be a baby great white shark off the coast of California last year. In January, the team published the photos in the journal Environmental Biology of Fishes. "Where white sharks give birth is one of the holy grails of shark science. No one has ever been able to pinpoint where they are born, nor has anyone seen a newborn baby shark alive," Gauna said in a UC Riverside press release.
Towards Strong AI: Transformational Beliefs and Scientific Creativity
Eschker, Samuel J., Liu, Chuanhai
Strong artificial intelligence (AI) is envisioned to possess general cognitive abilities and scientific creativity comparable to human intelligence, encompassing both knowledge acquisition and problem-solving. While remarkable progress has been made in weak AI, the realization of strong AI remains a topic of intense debate and critical examination. In this paper, we explore pivotal innovations in the history of astronomy and physics, focusing on the discovery of Neptune and the concept of scientific revolutions as perceived by philosophers of science. Building on these insights, we introduce a simple theoretical and statistical framework of weak beliefs, termed the Transformational Belief (TB) framework, designed as a foundation for modeling scientific creativity. Through selected illustrative examples in statistical science, we demonstrate the TB framework's potential as a promising foundation for understanding, analyzing, and even fostering creativity -- paving the way toward the development of strong AI. We conclude with reflections on future research directions and potential advancements.