trustworthiness
Bridging the Gap Between Climate Science and Machine Learning in Climate Model Emulation
Schmidt, Luca, Effenberger, Nina
While climate models provide insights for climate decision-making, their use is constrained by significant computational and technical demands. Although machine learning (ML) emulators offer a way to bypass the high computational costs, their effective use remains challenging. The hurdles are diverse, ranging from limited accessibility and a lack of specialized knowledge to a general mistrust of ML methods that are perceived as insufficiently physical. Here, we introduce a framework to overcome these barriers by integrating both climate science and machine learning perspectives. We find that designing easy-to-adopt emulators that address a clearly defined task and demonstrating their reliability offers a promising path for bridging the gap between our two fields.
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models
Artificial intelligence has significantly impacted medical applications, particularly with the advent of Medical Large Vision Language Models (Med-LVLMs), sparking optimism for the future of automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, posing significant risks for future model deployment. In this paper, we introduce CARES and aim to comprehensively evaluate the Trustworthiness of Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness. CARES comprises about 41K question-answer pairs in both closed and open-ended formats, covering 16 medical image modalities and 27 anatomical regions. Our analysis reveals that the models consistently exhibit concerns regarding trustworthiness, often displaying factual inaccuracies and failing to maintain fairness across different demographic groups. Furthermore, they are vulnerable to attacks and demonstrate a lack of privacy awareness.
Semantic Density: Uncertainty Quantification for Large Language Models through Confidence Measurement in Semantic Space
With the widespread application of Large Language Models (LLMs) to various domains, concerns regarding the trustworthiness of LLMs in safety-critical scenarios have been raised, due to their unpredictable tendency to hallucinate and generate misinformation. Existing LLMs do not have an inherent functionality to provide the users with an uncertainty/confidence metric for each response it generates, making it difficult to evaluate trustworthiness. Although several studies aim to develop uncertainty quantification methods for LLMs, they have fundamental limitations, such as being restricted to classification tasks, requiring additional training and data, considering only lexical instead of semantic information, and being prompt-wise but not response-wise. A new framework is proposed in this paper to address these issues.
Reports of the Association for the Advancement of Artificial Intelligence's 2025 Fall Symposium Series
The Association for the Advancement of Artificial Intelligence's 2025 Fall Symposium Series was held November 6-8, 2025, at the Westin Arlington Gateway in Arlington, Virginia. There were six symposia in the program: AI for Social Good: Emerging Methods, Measures, Data, and Ethics; AI Trustworthiness and Risk Assessment for Challenged Contexts; Engineering Safety-Critical AI Systems; First AAAI Symposium on Quantum Information and Machine Learning: Bridging Quantum Computing and Artificial Intelligence; Safe, Ethical, Certified, Uncertainty-aware, Robust, and Explainable AI for Health; and Unifying Representations for Robot Application Development. This report contains summaries of the symposia, which were submitted by most, but not all, of the symposium organizers. AI has demonstrated transformative potential across sectors such as aging, combating information manipulation, disaster response, education, environmental sustainability, government, healthcare, social care, transportation, and urban planning. Yet, the systematic development of AI For Social Good remains fragmented across those many research communities, with limited convergence around effective methodologies, equitable impact measurement, or access to important data and long-term engagement with targeted populations. The main objective for this symposium was to convene across disciplines and engage researchers, practitioners, and policymakers, with a particular focus on finding methods, measures and data that could be used in multiple settings. There were roughly 30 participants.