Goto

Collaborating Authors

 Overview


A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions

arXiv.org Artificial Intelligence

The remarkable performance of large language models (LLMs) in content generation, coding, and common-sense reasoning has spurred widespread integration into many facets of society. However, integration of LLMs raises valid questions on their reliability and trustworthiness, given their propensity to generate hallucinations: plausible, factually-incorrect responses, which are expressed with striking confidence. Previous work has shown that hallucinations and other non-factual responses generated by LLMs can be detected by examining the uncertainty of the LLM in its response to the pertinent prompt, driving significant research efforts devoted to quantifying the uncertainty of LLMs. This survey seeks to provide an extensive review of existing uncertainty quantification methods for LLMs, identifying their salient features, along with their strengths and weaknesses. We present existing methods within a relevant taxonomy, unifying ostensibly disparate methods to aid understanding of the state of the art. Furthermore, we highlight applications of uncertainty quantification methods for LLMs, spanning chatbot and textual applications to embodied artificial intelligence applications in robotics. We conclude with open research challenges in uncertainty quantification of LLMs, seeking to motivate future research.


MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

arXiv.org Artificial Intelligence

As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building upon pre-trained LLMs, this family of models further develops multimodal perception and reasoning capabilities that are impressive, such as writing code given a flow chart or creating stories based on an image. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. Distinct from the traditional train-eval-test paradigm that only favors a single task like image classification, the versatility of MLLMs has spurred the rise of various new benchmarks and evaluation methods. In this paper, we aim to present a comprehensive survey of MLLM evaluation, discussing four key aspects: 1) the summarised benchmarks types divided by the evaluation capabilities, including foundation capabilities, model self-analysis, and extented applications; 2) the typical process of benchmark counstruction, consisting of data collection, annotation, and precautions; 3) the systematic evaluation manner composed of judge, metric, and toolkit; 4) the outlook for the next benchmark. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods, thereby driving the progress of MLLM research.


A Scoping Review of ChatGPT Research in Accounting and Finance

arXiv.org Artificial Intelligence

This paper provides a review of recent publications and working papers on ChatGPT and related Large Language Models (LLMs) in accounting and finance. The aim is to understand the current state of research in these two areas and identify potential research opportunities for future inquiry. We identify three common themes from these earlier studies. The first theme focuses on applications of ChatGPT and LLMs in various fields of accounting and finance. The second theme utilizes ChatGPT and LLMs as a new research tool by leveraging their capabilities such as classification, summarization, and text generation. The third theme investigates implications of LLM adoption for accounting and finance professionals, as well as for various organizations and sectors. While these earlier studies provide valuable insights, they leave many important questions unanswered or partially addressed. We propose venues for further exploration and provide technical guidance for researchers seeking to employ ChatGPT and related LLMs as a tool for their research.


Human-Calibrated Automated Testing and Validation of Generative Language Models

arXiv.org Artificial Intelligence

This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount.


Efficient Distributed Training through Gradient Compression with Sparsification and Quantization Techniques

arXiv.org Artificial Intelligence

This study investigates the impact of gradient compression on distributed training performance, focusing on sparsification and quantization techniques, including top-k, DGC, and QSGD. In baseline experiments, random-k compression results in severe performance degradation, highlighting its inefficacy. In contrast, using top-k and DGC at 50 times compression yields performance improvements, reducing perplexity by up to 0.06 compared to baseline. Experiments across 1, 2, and 4 workers demonstrate that conservative sparsification can have a regularizing effect, especially for smaller models, while compression ratios above 5000 times impair performance, particularly for DGC. Communication times are reduced across all compression methods, with top-k and DGC decreasing communication to negligible levels at high compression ratios. However, increased computation times offset this efficiency for top-k due to sorting demands, making it less scalable than DGC or QSGD. In convergence tests, sparsification techniques show accelerated convergence, requiring fewer epochs than the baseline, which has implications for computational savings. Although precision trade-offs emerge, floating point errors are mitigated by compression. This study's findings underscore the need to tune hyperparameters specifically for each compression technique to achieve optimal model performance, especially in distributed training systems.


Reinforcement Learning: An Overview

arXiv.org Artificial Intelligence

This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based RL, policy-gradient methods, model-based methods, and various other topics (including a very brief discussion of RL+LLMs).


The Prompt Canvas: A Literature-Based Practitioner Guide for Creating Effective Prompts in Large Language Models

arXiv.org Artificial Intelligence

The rise of large language models (LLMs) has highlighted the importance of prompt engineering as a crucial technique for optimizing model outputs. While experimentation with various prompting methods, such as Few-shot, Chain-of-Thought, and role-based techniques, has yielded promising results, these advancements remain fragmented across academic papers, blog posts and anecdotal experimentation. The lack of a single, unified resource to consolidate the field's knowledge impedes the progress of both research and practical application. This paper argues for the creation of an overarching framework that synthesizes existing methodologies into a cohesive overview for practitioners. Using a design-based research approach, we present the Prompt Canvas, a structured framework resulting from an extensive literature review on prompt engineering that captures current knowledge and expertise. By combining the conceptual foundations and practical strategies identified in prompt engineering, the Prompt Canvas provides a practical approach for leveraging the potential of Large Language Models. It is primarily designed as a learning resource for pupils, students and employees, offering a structured introduction to prompt engineering. This work aims to contribute to the growing discourse on prompt engineering by establishing a unified methodology for researchers and providing guidance for practitioners.


Power Plant Detection for Energy Estimation using GIS with Remote Sensing, CNN & Vision Transformers

arXiv.org Artificial Intelligence

In this research, we propose a hybrid model for power plant detection to assist energy estimation applications, by pipelining GIS (Geographical Information Systems) having Remote Sensing capabilities with CNN (Convolutional Neural Networks) and ViT (Vision Transformers). Our proposed approach enables real-time analysis with multiple data types on a common map via the GIS, entails feature-extraction abilities due to the CNN, and captures long-range dependencies through the ViT. This hybrid approach is found to enhance classification, thus helping in the monitoring and operational management of power plants; hence assisting energy estimation and sustainable energy planning in the future. It exemplifies adequate deployment of machine learning methods in conjunction with domain-specific approaches to enhance performance.


A Survey of Sustainability in Large Language Models: Applications, Economics, and Challenges

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have transformed numerous domains by providing advanced capabilities in natural language understanding, generation, and reasoning. Despite their groundbreaking applications across industries such as research, healthcare, and creative media, their rapid adoption raises critical concerns regarding sustainability. This survey paper comprehensively examines the environmental, economic, and computational challenges associated with LLMs, focusing on energy consumption, carbon emissions, and resource utilization in data centers. By synthesizing insights from existing literature, this work explores strategies such as resource-efficient training, sustainable deployment practices, and lifecycle assessments to mitigate the environmental impacts of LLMs. Key areas of emphasis include energy optimization, renewable energy integration, and balancing performance with sustainability. The findings aim to guide researchers, practitioners, and policymakers in developing actionable strategies for sustainable AI systems, fostering a responsible and environmentally conscious future for artificial intelligence.


Entity-based Reinforcement Learning for Autonomous Cyber Defence

arXiv.org Artificial Intelligence

A significant challenge for autonomous cyber defence is ensuring a defensive agent's ability to generalise across diverse network topologies and configurations. This capability is necessary for agents to remain effective when deployed in dynamically changing environments, such as an enterprise network where devices may frequently join and leave. Standard approaches to deep reinforcement learning, where policies are parameterised using a fixed-input multi-layer perceptron (MLP) expect fixed-size observation and action spaces. In autonomous cyber defence, this makes it hard to develop agents that generalise to environments with network topologies different from those trained on, as the number of nodes affects the natural size of the observation and action spaces. To overcome this limitation, we reframe the problem of autonomous network defence using entity-based reinforcement learning, where the observation and action space of an agent are decomposed into a collection of discrete entities. This framework enables the use of policy parameterisations specialised in compositional generalisation. We train a Transformer-based policy on the Yawning Titan cyber-security simulation environment and test its generalisation capabilities across various network topologies. We demonstrate that this approach significantly outperforms an MLP-based policy when training across fixed-size networks of varying topologies, and matches performance when training on a single network. We also demonstrate the potential for zero-shot generalisation to networks of a different size to those seen in training. These findings highlight the potential for entity-based reinforcement learning to advance the field of autonomous cyber defence by providing more generalisable policies capable of handling variations in real-world network environments.