Goto

Collaborating Authors

 assessment




Symbolic Doomsday Clock moves closer to midnight amid 'catastrophic risks'

Al Jazeera

The world is closer than ever to destruction, scientists have said, as the Doomsday Clock was set at 85 seconds to midnight for 2026, the gloomiest assessment of humanity's prospects since the beginning of the tradition in 1947. The Bulletin of the Atomic Scientists, a not-for-profit organisation founded by Albert Einstein and other scientists, warned in its annual assessment on Tuesday that international cooperation is going backwards on nuclear weapons, climate change and biotechnology, while artificial intelligence poses new threats. "The Doomsday Clock's message cannot be clearer. Catastrophic risks are on the rise, cooperation is on the decline, and we are running out of time," said Alexandra Bell, the president and CEO of the Bulletin of the Atomic Scientists. In a more detailed statement explaining the reasoning for moving the clock closer to midnight, the bulletin expressed concerns that countries including Russia, China, and the United States were becoming "increasingly aggressive, adversarial, and nationalistic".


Trump Warned of a Tren de Aragua 'Invasion.' US Intel Told a Different Story

WIRED

Trump Warned of a Tren de Aragua'Invasion.' Hundreds of records obtained by WIRED show thin intelligence on the Venezuelan gang in the United States, describing fragmented, low-level crime rather than a coordinated terrorist threat. Alleged members of Tren de Aragua sit handcuffed during a preliminary hearing on July 9, 2025, in Santiago, Chile, where they faced homicide charges. As the Trump administration publicly cast Venezuela's Tren de Aragua (TdA) as a unified terrorist force tied to President Nicolás Maduro and operating inside the United States, hundreds of internal US government records obtained by WIRED tell a far less certain story. Intelligence taskings, law-enforcement bulletins, and drug-task-force assessments show that agencies spent much of 2025 struggling to determine whether TdA even functioned as an organized entity in the US at all--let alone as a coordinated national security threat.


LOVA3: Learning to Visual Question Answering, Asking and Assessment

Neural Information Processing Systems

Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. By enhancing these capabilities, humans can more effectively utilize data, leading to better comprehension and learning outcomes. However, current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. In this study, we introduce LOVA3, an innovative framework named ``Learning tO Visual Question Answering, Asking and Assessment,'' designed to equip MLLMs with these additional capabilities. Our approach involves the creation of two supplementary training tasks GenQA and EvalQA, aiming at fostering the skills of asking and assessing questions in the context of images.


Bias and Volatility: A Statistical Framework for Evaluating Large Language Model's Stereotypes and the Associated Generation Inconsistency

Neural Information Processing Systems

We present a novel statistical framework for analyzing stereotypes in large language models (LLMs) by systematically estimating the bias and variation in their generation. Current evaluation metrics in the alignment literature often overlook the randomness of stereotypes caused by the inconsistent generative behavior of LLMs. For example, this inconsistency can result in LLMs displaying contradictory stereotypes, including those related to gender or race, for identical professions across varied contexts. Neglecting such inconsistency could lead to misleading conclusions in alignment evaluations and hinder the accurate assessment of the risk of LLM applications perpetuating or amplifying social stereotypes and unfairness.This work proposes a Bias-Volatility Framework (BVF) that estimates the probability distribution function of LLM stereotypes. Specifically, since the stereotype distribution fully captures an LLM's generation variation, BVF enables the assessment of both the likelihood and extent to which its outputs are against vulnerable groups, thereby allowing for the quantification of the LLM's aggregated discrimination risk. Furthermore, we introduce a mathematical framework to decompose an LLM's aggregated discrimination risk into two components: bias risk and volatility risk, originating from the mean and variation of LLM's stereotype distribution, respectively. We apply BVF to assess 12 commonly adopted LLMs and compare their risk levels. Our findings reveal that: i) Bias risk is the primary cause of discrimination risk in LLMs; ii) Most LLMs exhibit significant pro-male stereotypes for nearly all careers; iii) Alignment with reinforcement learning from human feedback lowers discrimination by reducing bias, but increases volatility; iv) Discrimination risk in LLMs correlates with key sociol-economic factors like professional salaries. Finally, we emphasize that BVF can also be used to assess other dimensions of generation inconsistency's impact on LLM behavior beyond stereotypes, such as knowledge mastery.


BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

Neural Information Processing Systems

We introduce BioTrove, the largest publicly accessible dataset designed to advance AI applications in biodiversity. Curated from the iNaturalist platform and vetted to include only research-grade data, BioTrove contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms: Animalia (animals), Fungi (fungi), and Plantae (plants), spanning approximately 366.6K species. Each image is annotated with scientific names, taxonomic hierarchies, and common names, providing rich metadata to support accurate AI model development across diverse species and ecosystems.We demonstrate the value of BioTrove by releasing a suite of CLIP models trained using a subset of 40 million captioned images, known as BioTrove-Train. This subset focuses on seven categories within the dataset that are underrepresented in standard image recognition models, selected for their critical role in biodiversity and agriculture: Aves (birds), Arachnida} (spiders/ticks/mites), Insecta (insects), Plantae (plants), Fungi (fungi), Mollusca (snails), and Reptilia (snakes/lizards). To support rigorous assessment, we introduce several new benchmarks and report model accuracy for zero-shot learning across life stages, rare species, confounding species, and multiple taxonomic levels.We anticipate that BioTrove will spur the development of AI models capable of supporting digital tools for pest control, crop monitoring, biodiversity assessment, and environmental conservation. These advancements are crucial for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. BioTrove is publicly available, easily accessible, and ready for immediate use.


Is a Modular Architecture Enough?

Neural Information Processing Systems

Inspired from human cognition, machine learning systems are gradually revealing advantages of sparser and more modular architectures. Recent work demonstrates that not only do some modular architectures generalize well, but they also lead to better out of distribution generalization, scaling properties, learning speed, and interpretability. A key intuition behind the success of such systems is that the data generating system for most real-world settings is considered to consist of sparse modular connections, and endowing models with similar inductive biases will be helpful. However, the field has been lacking in a rigorous quantitative assessment of such systems because these real-world data distributions are complex and unknown. In this work, we provide a thorough assessment of common modular architectures, through the lens of simple and known modular data distributions. We highlight the benefits of modularity and sparsity and reveal insights on the challenges faced while optimizing modular systems. In doing so, we propose evaluation metrics that highlight the benefits of modularity, the regimes in which these benefits are substantial, as well as the sub-optimality of current end-to-end learned modular systems as opposed to their claimed potential.


PEACE: A Dataset of Pharmaceutical Care for Cancer Pain Analgesia Evaluation and Medication Decision

Neural Information Processing Systems

Over half of cancer patients experience long-term pain management challenges. Recently, interest has grown in systems for cancer pain treatment effectiveness assessment (TEA) and medication recommendation (MR) to optimize pharmacological care. These systems aim to improve treatment effectiveness by recommending personalized medication plans based on comprehensive patient information. Despite progress, current systems lack multidisciplinary treatment (MDT) team assessments of treatment and the patient's perception of medication, crucial for effective cancer pain management. Moreover, managing cancer pain medication requires multiple adjustments to the treatment plan based on the patient's evolving condition, a detail often missing in existing datasets.


BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Neural Information Processing Systems

AI models are increasingly prevalent in high-stakes environments, necessitating thorough assessment of their capabilities and risks. Benchmarks are popular for measuring these attributes and for comparing model performance, tracking progress, and identifying weaknesses in foundation and non-foundation models. They can inform model selection for downstream tasks and influence policy initiatives. However, not all benchmarks are the same: their quality depends on their design and usability. In this paper, we develop an assessment framework considering 40 best practices across a benchmark's life cycle and evaluate 25 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues. We further find that most benchmarks do not report statistical significance of their results nor can results be easily replicated. To support benchmark developers in aligning with best practices, we provide a checklist for minimum quality assurance based on our assessment. We also develop a living repository of benchmark assessments to support benchmark comparability.