Goto

Collaborating Authors

 mini


Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs

Ma, Julian, Wang, Jun, Fountas, Zafeirios

arXiv.org Artificial Intelligence

Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.


LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost

Huang, Donghao, Chew, Shila, Dutkiewicz, Anna, Wang, Zhaoxia

arXiv.org Artificial Intelligence

Assessing software test coverage at scale remains a bottleneck in QA pipelines. We present LLM-as-a-Judge (LAJ), a production-ready, rubric-driven framework for evaluating Gherkin acceptance tests with structured JSON outputs. Across 20 model configurations (GPT-4, GPT-5 with varying reasoning effort, and open-weight models) on 100 expert-annotated scripts over 5 runs (500 evaluations), we provide the first comprehensive analysis spanning accuracy, operational reliability, and cost. We introduce the Evaluation Completion Rate (ECR@1) to quantify first-attempt success, revealing reliability from 85.4% to 100.0% with material cost implications via retries. Results show that smaller models can outperform larger ones: GPT-4o Mini attains the best accuracy (6.07 MAAE), high reliability (96.6% ECR@1), and low cost ($1.01 per 1K), yielding a 78x cost reduction vs. GPT-5 (high reasoning) while improving accuracy. Reasoning effort is model-family dependent: GPT-5 benefits from increased reasoning (with predictable accuracy-cost tradeoffs), whereas open-weight models degrade across all dimensions as reasoning increases. Overall, cost spans 175x ($0.45-$78.96 per 1K). We release the dataset, framework, and code to support reproducibility and deployment.


Who is Afraid of Minimal Revision?

Baccini, Edoardo, Christoff, Zoé, Gierasimczuk, Nina, Verbrugge, Rineke

arXiv.org Artificial Intelligence

The principle of minimal change in belief revision theory requires that, when accepting new information, one keeps one's belief state as close to the initial belief state as possible. This is precisely what the method known as minimal revision does. However, unlike less conservative belief revision methods, minimal revision falls short in learning power: It cannot learn everything that can be learned by other learning methods. We begin by showing that, despite this limitation, minimal revision is still a successful learning method in a wide range of situations. Firstly, it can learn any problem that is finitely identifiable. Secondly, it can learn with positive and negative data, as long as one considers finitely many possibilities. We then characterize the prior plausibility assignments (over finitely many possibilities) that enable one to learn via minimal revision, and do the same for conditioning and lexicographic upgrade. Finally, we show that not all of our results still hold when learning from possibly erroneous information.


Supplementary Material: The Role of Global Labels in Few-Shot Classification and How to Infer Them

Neural Information Processing Systems

The supplementary material is organized as follows: Appendix A contains the proofs accompanying our theoretical analysis. Appendix B presents additional experiment results. For a dataset D, let π (D) be the set of class labels from D . In (A.4), the inequality is formed because the denominator MeLa is compatible with different meta-learning algorithms. Table 7 suggests that MeLa obtains test performance comparable to RFS, which is the oracle setting.



A Hybrid Search for Complex Table Question Answering in Securities Report

Shirafuji, Daiki, Tanaka, Koji, Saito, Tatsuhiko

arXiv.org Artificial Intelligence

Recently, Large Language Models (LLMs) are gaining increased attention in the domain of Table Question Answering (TQA), particularly for extracting information from tables in documents. However, directly entering entire tables as long text into LLMs often leads to incorrect answers because most LLMs cannot inherently capture complex table structures. In this paper, we propose a cell extraction method for TQA without manual identification, even for complex table headers. Our approach estimates table headers by computing similarities between a given question and individual cells via a hybrid retrieval mechanism that integrates a language model and TF-IDF. We then select as the answer the cells at the intersection of the most relevant row and column. Furthermore, the language model is trained using contrastive learning on a small dataset of question-header pairs to enhance performance. We evaluated our approach in the TQA dataset from the U4 shared task at NTCIR-18. The experimental results show that our pipeline achieves an accuracy of 74.6\%, outperforming existing LLMs such as GPT-4o mini~(63.9\%). In the future, although we used traditional encoder models for retrieval in this study, we plan to incorporate more efficient text-search models to improve performance and narrow the gap with human evaluation results.


Latent Planning via Embedding Arithmetic: A Contrastive Approach to Strategic Reasoning

Hamara, Andrew, Hamerly, Greg, Rivas, Pablo, Freeman, Andrew C.

arXiv.org Artificial Intelligence

Planning in high-dimensional decision spaces is increasingly being studied through the lens of learned representations. Rather than training policies or value heads, we investigate whether planning can be carried out directly in an evaluation-aligned embedding space. We introduce SOLIS, which learns such a space using supervised contrastive learning. In this representation, outcome similarity is captured by proximity, and a single global advantage vector orients the space from losing to winning regions. Candidate actions are then ranked according to their alignment with this direction, reducing planning to vector operations in latent space. We demonstrate this approach in chess, where SOLIS uses only a shallow search guided by the learned embedding to reach competitive strength under constrained conditions. More broadly, our results suggest that evaluation-aligned latent planning offers a lightweight alternative to traditional dynamics models or policy learning.



Schema Generation for Large Knowledge Graphs Using Large Language Models

Zhang, Bohui, He, Yuan, Pintscher, Lydia, Peñuela, Albert Meroño, Simperl, Elena

arXiv.org Artificial Intelligence

Schemas play a vital role in ensuring data quality and supporting usability in the Semantic Web and natural language processing. Traditionally, their creation demands substantial involvement from knowledge engineers and domain experts. Leveraging the impressive capabilities of large language models (LLMs) in tasks like ontology engineering, we explore schema generation using LLMs. To bridge the resource gap, we introduce two datasets: YAGO Schema and Wikidata EntitySchema, along with novel evaluation metrics. The LLM-based pipelines utilize local and global information from knowledge graphs (KGs) to generate schemas in Shape Expressions (ShEx). Experiments demonstrate LLMs' strong potential in producing high-quality ShEx schemas, paving the way for scalable, automated schema generation for large KGs. Furthermore, our benchmark introduces a new challenge for structured generation, pushing the limits of LLMs on syntactically rich formalisms.


Machine Learning for Campus Energy Resilience: Clustering and Time-Series Forecasting in Intelligent Load Shedding

Oyinlola, Salim, Oluseyi, Peter Olabisi

arXiv.org Artificial Intelligence

The growing demand for reliable electricity in universities necessitates intelligent energy management. This study proposes a machine learning-based load shedding framework for the University of Lagos, designed to optimize distribution and reduce waste. The methodology followed three main stages. First, a dataset of 3,648 hourly records from 55 buildings was compiled to develop building-level consumption models. Second, Principal Component Analysis was applied for dimensionality reduction, and clustering validation techniques were used to determine the optimal number of demand groups. Mini-Batch K-Means was then employed to classify buildings into high-, medium-, and low-demand clusters. Finally, short-term load forecasting was performed at the cluster level using multiple statistical and deep learning models, including ARIMA, SARIMA, Prophet, LSTM, and GRU. Results showed Prophet offered the most reliable forecasts, while Mini-Batch K-Means achieved stable clustering performance. By integrating clustering with forecasting, the framework enabled a fairer, data-driven load shedding strategy that reduces inefficiencies and supports climate change mitigation through sustainable energy management.