AITopics

Genre: Research Report > New Finding (0.96)

Technology: Information Technology > Artificial Intelligence (0.56)

Neural Information Processing SystemsFeb-12-2026, 06:26:31 GMT

46b5405a720a99db4c758cff43c8b4d3-Paper-Datasets_and_Benchmarks_Track.pdf

large language model, machine learning, natural language, (20 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > China > Shanghai > Shanghai (0.04)
South America > Peru > Lima Department > Lima Province > Lima (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology (1.00)
Leisure & Entertainment > Sports > Track & Field (0.92)
Health & Medicine > Consumer Health (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Neural Information Processing SystemsDec-27-2025, 07:04:36 GMT

GAIA: Delving into Gradient-based Attribution Abnormality for Out-of-distribution Detection

Detecting out-of-distribution (OOD) examples is crucial to guarantee the reliability and safety of deep neural networks in real-world settings. In this paper, we offer an innovative perspective on quantifying the disparities between in-distribution (ID) and OOD data---analyzing the uncertainty that arises when models attempt to explain their predictive decisions. This perspective is motivated by our observation that gradient-based attribution methods encounter challenges in assigning feature importance to OOD data, thereby yielding divergent explanation patterns. Consequently, we investigate how attribution gradients lead to uncertain explanation outcomes and introduce two forms of abnormalities for OOD detection: the zero-deflation abnormality and the channel-wise average abnormality. We then propose GAIA, a simple and effective approach that incorporates Gradient Abnormality Inspection and Aggregation. The effectiveness of GAIA is validated on both commonly utilized (CIFAR) and large-scale (ImageNet-1k) benchmarks. Specifically, GAIA reduces the average FPR95 by 23.10% on CIFAR10 and by 45.41% on CIFAR100 compared to advanced post-hoc methods.

gradient-based attribution abnormality, name change, out-of-distribution detection, (4 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

arXiv.org Artificial IntelligenceDec-9-2025

Stochasticity in Agentic Evaluations: Quantifying Inconsistency with Intraclass Correlation

Mustahsan, Zairah, Lim, Abel, Anand, Megna, Jain, Saahil, McCann, Bryan

As large language models become components of larger agentic systems, evaluation reliability becomes critical: unreliable sub-agents introduce brittleness into downstream system behavior. Yet current evaluation practice, reporting a single accuracy number from a single run, obscures the variance underlying these results, making it impossible to distinguish genuine capability improvements from lucky sampling. We propose adopting Intraclass Correlation Coefficient (ICC), a metric from measurement science, to characterize this variance. ICC decomposes observed variance into between-query variance (task difficulty) and within-query variance (agent inconsistency), highlighting whether reported results reflect true capability or measurement noise. We evaluated on GAIA (Levels 1-3, measuring agentic capabilities across varying reasoning complexity) and FRAMES (measuring retrieval and factuality across multiple documents). We found that ICC varies dramatically with task structure, with reasoning and retrieval tasks (FRAMES) exhibit ICC=0.4955-0.7118 across models, and agentic tasks (GAIA) exhibiting ICC=0.304-0.774 across models. For sub-agent replacement decisions in agentic systems, accuracy improvements are only trustworthy if ICC also improves. We demonstrate that ICC converges by n=8-16 trials for structured tasks and n>=32 for complex reasoning, enabling practitioners to set evidence-based resampling budgets. We recommend reporting accuracy alongside ICC and within-query variance as standard practice, and propose updated Evaluation Cards capturing these metrics. By making evaluation stability visible, we aim to transform agentic benchmarking from opaque leaderboard competition to trustworthy experimental science. Our code is open-sourced at https://github.com/youdotcom-oss/stochastic-agent-evals.

large language model, machine learning, variance, (20 more...)

2512.0671

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-3-2025

GAIA: A Foundation Model for Operational Atmospheric Dynamics

Asanjan, Ata Akbari, Alexander, Olivia, Berg, Tom, Peng, Stephen, Makki, Jad, Zhang, Clara, Yang, Matt, Shidham, Disha, Chakraborty, Srija, Bender, William, Crawford, Cara, Ravindran, Arun, Raiman, Olivier, Potere, David, Bell, David

We introduce GAIA (Geospatial Artificial Intelligence for Atmospheres), a hybrid self-supervised geospatial foundation model that fuses Masked Autoencoders (MAE) with self-distillation with no labels (DINO) to generate semantically rich representations from global geostationary satellite imagery. Pre-trained on 15 years of globally-merged infrared observations (2001-2015), GAIA learns disentangled representations that capture atmospheric dynamics rather than trivial diurnal patterns, as evidenced by distributed principal component structure and temporal coherence analysis. We demonstrate robust reconstruction capabilities across varying data availability (30-95% masking), achieving superior gap-filling performance on real missing data patterns. When transferred to downstream tasks, GAIA consistently outperforms an MAE-only baseline: improving atmospheric river segmentation (F1: 0.58 vs 0.52), enhancing tropical cyclone detection (storm-level recall: 81% vs 75%, early detection: 29% vs 17%), and maintaining competitive precipitation estimation performance. Analysis reveals that GAIA's hybrid objectives encourage learning of spatially coherent, object-centric features distributed across multiple principal components rather than concentrated representations focused on reconstruction. This work demonstrates that combining complementary self-supervised objectives yields more transferable representations for diverse atmospheric modeling tasks. Model weights and code are available at: https://huggingface.co/bcg-usra-nasa-gaia/GAIA-v1.

gaia, machine learning, natural language, (17 more...)

2505.18179

Country: North America > United States (1.00)

Genre: Research Report > New Finding (0.46)

Industry:

Government > Regional Government > North America Government > United States Government (0.88)
Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.36)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(3 more...)

New ScientistOct-31-2025, 09:30:31 GMT

If you could upload your mind to a virtual utopia, would you?

"What does it really mean to upload your consciousness into intangible space?" In, the characters face an impossible choice: upload your mind into a virtual utopia, or crumble away in the abandoned physical world. Mind-uploading is familiar to us as a science fiction trope, often anchoring relationship dramas and philosophical inquiry. But what does it really mean to upload your consciousness into intangible space? Can the mechanics be extrapolated from our present-day science?

consciousness, upload, virtual utopia, (12 more...)

New Scientist

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence (1.00)

Neural Information Processing SystemsOct-10-2025, 01:05:51 GMT

GAIA: Rethinking Action Quality Assessment for AI-Generated Videos Zijian Chen 1, Wei Sun

Based on GAIA, we evaluate a suite of popular text-to-video (T2V) models on their ability to generate visually rational actions, revealing their pros and cons on different categories of actions.

assessment, dataset, video, (14 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > China > Shanghai > Shanghai (0.04)
South America > Peru > Lima Department > Lima Province > Lima (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology (1.00)
Leisure & Entertainment > Sports > Track & Field (0.92)
Health & Medicine > Consumer Health (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Handa, Gunmay, Wu, Zekun, Koshiyama, Adriano, Treleaven, Philip

Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects

arXiv.org Artificial IntelligenceSep-8-2025

Personality manipulation in large language models (LLMs) is increasingly applied in customer service and agentic scenarios, yet its mechanisms and trade-offs remain unclear. We present a systematic study of personality control using the Big Five traits, comparing in-context learning (ICL), parameter-efficient fine-tuning (PEFT), and mechanistic steering (MS). Our contributions are fourfold. First, we construct a contrastive dataset with balanced high/low trait responses, enabling effective steering vector computation and fair cross-method evaluation. Second, we introduce a unified evaluation framework based on within-run $Δ$ analysis that disentangles, reasoning capability, agent performance, and demographic bias across MMLU, GAIA, and BBQ benchmarks. Third, we develop trait purification techniques to separate openness from conscientiousness, addressing representational overlap in trait encoding. Fourth, we propose a three-level stability framework that quantifies method-, trait-, and combination-level robustness, offering practical guidance under deployment constraints. Experiments on Gemma-2-2B-IT and LLaMA-3-8B-Instruct reveal clear trade-offs: ICL achieves strong alignment with minimal capability loss, PEFT delivers the highest alignment at the cost of degraded task performance, and MS provides lightweight runtime control with competitive effectiveness. Trait-level analysis shows openness as uniquely challenging, agreeableness as most resistant to ICL, and personality encoding consolidating around intermediate layers. Taken together, these results establish personality manipulation as a multi-level probe into behavioral representation, linking surface conditioning, parameter encoding, and activation-level steering, and positioning mechanistic steering as a lightweight alternative to fine-tuning for both deployment and interpretability.

large language model, machine learning, personality trait, (18 more...)

2509.04794

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceAug-15-2025

MAPS: A Multilingual Benchmark for Global Agent Performance and Security

Hofman, Omer, Brokman, Jonathan, Rachmil, Oren, Bose, Shamik, Pahuja, Vikas, Shimizu, Toshiya, Starostina, Trisha, Marchisio, Kelly, Goldfarb-Tarrant, Seraphina, Vainshtein, Roman

Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety, agentic systems risk inheriting these limitations. This raises concerns about the accessibility of such systems, as users interacting in languages other than English may encounter unreliable or security-critical agent behavior. Despite growing interest in evaluating agentic AI, existing benchmarks focus exclusively on English, leaving multilingual settings unexplored. To address this gap, we propose MAPS, a multilingual benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks. MAPS builds on four widely used agentic benchmarks - GAIA (real-world tasks), SWE-bench (code generation), MATH (mathematical reasoning), and the Agent Security Benchmark (security). We translate each dataset into eleven diverse languages, resulting in 805 unique tasks and 9,660 total language-specific instances - enabling a systematic analysis of the multilingual effect on AI agents' performance and robustness. Empirically, we observe degradation in both performance and security when transitioning from English to other languages, with severity varying by task and correlating with the amount of translated input. Building on these findings, we provide actionable recommendations to guide agentic AI systems development and assessment under multilingual settings. This work establishes the first standardized evaluation framework for multilingual agentic AI, encouraging future research towards equitable, reliable, and accessible agentic AI. MAPS benchmark suite is publicly available at https://huggingface.co/datasets/Fujitsu-FRE/MAPS

benchmark, large language model, machine learning, (18 more...)

2505.15935

Country:

North America > Mexico (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Artificial IntelligenceJun-4-2025

Coding Agents with Multimodal Browsing are Generalist Problem Solvers

Soni, Aditya Bharat, Li, Boxuan, Wang, Xingyao, Chen, Valerie, Neubig, Graham

Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. In addition, AI agents have been specialized for domains such as software engineering, web navigation, and workflow automation. However, this results in agents that are good for one thing but fail to generalize beyond their intended scope. One reason for this is that agent developers provide a highly specialized set of tools or make architectural decisions optimized for a specific use case or benchmark. In this work, we ask the question: what is the minimal set of general tools that can be used to achieve high performance across a diverse set of tasks? Our answer is OpenHands-Versa, a generalist agent built with a modest number of general tools: code editing and execution, web search, as well as multimodal web browsing and file access. Importantly, OpenHands-Versa demonstrates superior or competitive performance over leading specialized agents across three diverse and challenging benchmarks: SWE-Bench Multimodal, GAIA, and The Agent Company, outperforming the best-performing previously published results with absolute improvements in success rate of 9.1, 1.3, and 9.1 points respectively. Further, we show how existing state-of-the-art multi-agent systems fail to generalize beyond their target domains. These results demonstrate the feasibility of developing a generalist agent to solve diverse tasks and establish OpenHands-Versa as a strong baseline for future research.

agent, artificial intelligence, openhand-v ersa, (15 more...)

2506.03011

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (0.46)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)