Goto

Collaborating Authors

 Croatia


On the Worst Prompt Performance of Large Language Models

Neural Information Processing Systems

The performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts, which raises significant concerns about their reliability in real-world scenarios. Existing studies often divide prompts into task-level instructions and case-level inputs and primarily focus on evaluating and improving robustness against variations in tasks-level instructions. However, this setup fails to fully address the diversity of real-world user queries and assumes the existence of task-specific datasets.


ConvoGen: Enhancing Conversational AI with Synthetic Data: A Multi-Agent Approach

arXiv.org Artificial Intelligence

In this paper, we present ConvoGen: an innovative framework for generating synthetic conversational data using multi-agent systems. Our method leverages few-shot learning and introduces iterative sampling from a dynamically updated few-shot hub to create diverse and realistic conversational scenarios. The generated data has numerous applications, including training and evaluating conversational AI models, and augmenting existing datasets for tasks like conversational intent classification or conversation summarization. Our experiments demonstrate the effectiveness of this method in producing high-quality diverse synthetic conversational data, highlighting its potential to enhance the development and evaluation of conversational AI systems.


Graph-based Uncertainty Metrics for Long-form Language Model Outputs

Neural Information Processing Systems

Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities, but these systems are still known to hallucinate, and granular uncertainty estimation for long-form LLM generations remains challenging. In this work, we propose Graph Uncertainty - which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics. Under this view, existing uncertainty estimation methods based on the concept of self-consistency can be viewed as using degree centrality as an uncertainty measure, and we show that more sophisticated alternatives such as closeness centrality provide consistent gains at claim-level uncertainty estimation. Moreover, we present uncertaintyaware decoding techniques that leverage both the graph structure and uncertainty estimates to improve the factuality of LLM generations by preserving only the most reliable claims. Compared to existing methods, our graph-based uncertainty metrics lead to an average of 6.8% relative gains on AUPRC across various long-form generation settings, and our end-to-end system provides consistent 2-4% gains in factuality over existing decoding techniques while significantly improving the informativeness of generated responses


Beyond Aesthetics: Cultural Competence in Text-to-Image Models

Neural Information Processing Systems

Text-to-Image (T2I) models are being increasingly adopted in diverse global communities where they create visual representations of their unique cultures. Current T2I benchmarks primarily focus on faithfulness, aesthetics, and realism of generated images, overlooking the critical dimension of cultural competence. In this work, we introduce a framework to evaluate cultural competence of T2I models along two crucial dimensions: cultural awareness and cultural diversity, and present a scalable approach using a combination of structured knowledge bases and large language models to build a large dataset of cultural artifacts to enable this evaluation. In particular, we apply this approach to build CUBE (CUltural BEnchmark for Text-to-Image models), a first-of-its-kind benchmark to evaluate cultural competence of T2I models.


XAI-Driven Client Selection for Federated Learning in Scalable 6G Network Slicing

arXiv.org Artificial Intelligence

In recent years, network slicing has embraced artificial intelligence (AI) models to manage the growing complexity of communication networks. In such a situation, AI-driven zero-touch network automation should present a high degree of flexibility and viability, especially when deployed in live production networks. However, centralized controllers suffer from high data communication overhead due to the vast amount of user data, and most network slices are reluctant to share private data. In federated learning systems, selecting trustworthy clients to participate in training is critical for ensuring system performance and reliability. The present paper proposes a new approach to client selection by leveraging an XAI method to guarantee scalable and fast operation of federated learning based analytic engines that implement slice-level resource provisioning at the RAN-Edge in a non-IID scenario. Attributions from XAI are used to guide the selection of devices participating in training. This approach enhances network trustworthiness for users and addresses the black-box nature of neural network models. The simulations conducted outperformed the standard approach in terms of both convergence time and computational cost, while also demonstrating high scalability.


Ergodic exploration of dynamic distribution

arXiv.org Artificial Intelligence

This research addresses the challenge of performing search missions in dynamic environments, particularly for drifting targets whose movement is dictated by a flow field. This is accomplished through a dynamical system that integrates two partial differential equations: one governing the dynamics and uncertainty of the probability distribution, and the other regulating the potential field for ergodic multi-agent search. The target probability field evolves in response to the target dynamics imposed by the environment and accomplished sensing efforts, while being explored by multiple robot agents guided by the potential field gradient. The proposed methodology was tested on two simulated search scenarios, one of which features a synthetically generated domain and showcases better performance when compared to the baseline method with static target probability over a range of agent to flow field velocity ratios. The second search scenario represents a realistic sea search and rescue mission where the search start is delayed, the search is performed in multiple robot flight missions, and the procedure for target drift uncertainty compensation is demonstrated. Furthermore, the proposed method provides an accurate survey completion metric, based on the known detection/sensing parameters, that correlates with the actual number of targets found independently.


ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions

arXiv.org Artificial Intelligence

Modern Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis. In order to assess such capabilities, several benchmarks have been devised (e.g., HumanEval). However, most benchmarks focus on code synthesis from natural language instructions. Hence, such benchmarks do not test for other forms of code understanding. Moreover, there have been concerns about contamination and leakage. That is, benchmark problems (or closely related problems) may appear in training set, strongly biasing benchmark results. In this work we investigate whether large language models can correctly predict runtime program behavior. To this end, we introduce ThrowBench, a benchmark consisting of over 2,400 short user-written programs written in four different programming languages. The majority of these programs throw an exception during runtime (due to a bug). LLMs are asked to predict whether a presented program throws an exception and, if so, which one. Evaluating our benchmark on six state-of-the-art code LLMs we see modest performance ranging from 19 to 38% (F1 score). Benchmarking a wider set of code capabilities could improve the assessment of code LLMs and help identify weak points in current models. Moreover, as ground-truth answers have been determined through program execution, leakage is not a concern. We release ThrowBench as well as all of our results together with this work.


Grouped Sequential Optimization Strategy -- the Application of Hyperparameter Importance Assessment in Deep Learning

arXiv.org Artificial Intelligence

In recent years, the rapid advancement of deep learning has led to significant breakthroughs across a wide range of applications, from computer vision to natural language processing, where hyperparameter optimization (HPO) has become increasingly vital in constructing models that achieve optimal performance. As the demand for HPO has been growing, the computational and time costs associated with it have become a significant bottleneck [1]. In this context, Hyperparameter Importance Assessment (HIA) has emerged as a promising solution. By evaluating the importance weights of individual hyperparameters and their combinations within specific models, HIA provides valuable insights into which hyperparameters most significantly impact model performance [2]. With this understanding, deep learning practitioners can focus on optimizing only those hyperparameters that have a more pronounced effect on performance. For less critical hyperparameters, users can reduce the search space during optimization or even fix them at certain values, thereby saving time in the model optimization process [3]. Although there has been considerable exploration of HIA, most existing studies have primarily focused on introducing new HIA methods or determining the importance rankings of hyperparameters for specific models within certain application scenarios. However, there has been limited exploration of how these insights can be strategically applied to enhance the efficiency of the optimization process. To address the challenges in the current research landscape, this paper aims to use Convolutional Neural Networks (CNNs) as the research case to introduce HIA into the deep learning pipeline, demonstrating that the insights gained from HIA can effectively enhance the efficiency of hyper-Second Conference on Parsimony and Learning (CPAL 2025).


EnigmaToM: Improve LLMs' Theory-of-Mind Reasoning Capabilities with Neural Knowledge Base of Entity States

arXiv.org Artificial Intelligence

Theory-of-Mind (ToM), the ability to infer others' perceptions and mental states, is fundamental to human interaction but remains a challenging task for Large Language Models (LLMs). While existing ToM reasoning methods show promise with reasoning via perceptual perspective-taking, they often rely excessively on LLMs, reducing their efficiency and limiting their applicability to high-order ToM reasoning, which requires multi-hop reasoning about characters' beliefs. To address these issues, we present EnigmaToM, a novel neuro-symbolic framework that enhances ToM reasoning by integrating a Neural Knowledge Base of entity states (Enigma) for (1) a psychology-inspired iterative masking mechanism that facilitates accurate perspective-taking and (2) knowledge injection that elicits key entity information. Enigma generates structured representations of entity states, which construct spatial scene graphs -- leveraging spatial information as an inductive bias -- for belief tracking of various ToM orders and enhancing events with fine-grained entity state details. Experimental results on multiple benchmarks, including ToMi, HiToM, and FANToM, show that EnigmaToM significantly improves ToM reasoning across LLMs of varying sizes, particularly excelling in high-order reasoning scenarios.


OptMetaOpenFOAM: Large Language Model Driven Chain of Thought for Sensitivity Analysis and Parameter Optimization based on CFD

arXiv.org Artificial Intelligence

Merging natural language interfaces with computational fluid dynamics (CFD) workflows presents transformative opportunities for both industry and research. In this study, we introduce OptMetaOpenFOAM - a novel framework that bridges MetaOpenFOAM with external analysis and optimization tool libraries through a large language model (LLM)-driven chain-of-thought (COT) methodology. By automating complex CFD tasks via natural language inputs, the framework empowers non-expert users to perform sensitivity analyses and parameter optimizations with markedly improved efficiency. The test dataset comprises 11 distinct CFD analysis or optimization tasks, including a baseline simulation task derived from an OpenFOAM tutorial covering fluid dynamics, combustion, and heat transfer. Results confirm that OptMetaOpenFOAM can accurately interpret user requirements expressed in natural language and effectively invoke external tool libraries alongside MetaOpenFOAM to complete the tasks. Furthermore, validation on a non-OpenFOAM tutorial case - namely, a hydrogen combustion chamber - demonstrates that a mere 200-character natural language input can trigger a sequence of simulation, postprocessing, analysis, and optimization tasks spanning over 2,000 lines of code. These findings underscore the transformative potential of LLM-driven COT methodologies in linking external tool for advanced analysis and optimization, positioning OptMetaOpenFOAM as an effective tool that streamlines CFD simulations and enhances their convenience and efficiency for both industrial and research applications. Code is available at https://github.com/Terry-cyx/MetaOpenFOAM.