Law
A Scoping Review of ChatGPT Research in Accounting and Finance
Dong, Mengming Michael, Stratopoulos, Theophanis C., Wang, Victor Xiaoqi
This paper provides a review of recent publications and working papers on ChatGPT and related Large Language Models (LLMs) in accounting and finance. The aim is to understand the current state of research in these two areas and identify potential research opportunities for future inquiry. We identify three common themes from these earlier studies. The first theme focuses on applications of ChatGPT and LLMs in various fields of accounting and finance. The second theme utilizes ChatGPT and LLMs as a new research tool by leveraging their capabilities such as classification, summarization, and text generation. The third theme investigates implications of LLM adoption for accounting and finance professionals, as well as for various organizations and sectors. While these earlier studies provide valuable insights, they leave many important questions unanswered or partially addressed. We propose venues for further exploration and provide technical guidance for researchers seeking to employ ChatGPT and related LLMs as a tool for their research.
Fine-Tuning LLMs with Noisy Data for Political Argument Generation and Post Guidance
Churina, Svetlana, Jaidka, Kokil
The incivility in social media discourse complicates the deployment of automated text generation models for politically sensitive content. Fine-tuning and prompting strategies are critical, but underexplored, solutions to mitigate toxicity in such contexts. This study investigates the fine-tuning and prompting effects on GPT-3.5 Turbo using subsets of the CLAPTON dataset of political discussion posts, comprising Twitter and Reddit data labeled for their justification, reciprocity and incivility. Fine-tuned models on Reddit data scored highest on discussion quality, while combined noisy data led to persistent toxicity. Prompting strategies reduced specific toxic traits, such as personal attacks, but had limited broader impact. The findings emphasize that high-quality data and well-crafted prompts are essential to reduce incivility and improve rhetorical quality in automated political discourse generation.
Human-Calibrated Automated Testing and Validation of Generative Language Models
Sudjianto, Agus, Zhang, Aijun, Neppalli, Srinivas, Joshi, Tarun, Malohlava, Michal
This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount.
Designing Domain-Specific Large Language Models: The Critical Role of Fine-Tuning in Public Opinion Simulation
Large language models (LLMs) have transformed natural language processing, yet face challenges in specialized tasks such as simulating opinions on environmental policies. This paper introduces a novel fine-tuning approach that integrates socio-demographic data from the UK Household Longitudinal Study, uniquely using profiling factors, such as age, gender, income, education, and region. This method enhances the accuracy and representation of generated views. By emulating diverse synthetic profiles, the fine-tuned models significantly outperform pre-trained counterparts, achieving measurable improvements in capturing demographic nuances. Evaluation metrics, including Chi-Squared, Cosine Similarity, Jaccard Index, and KL-divergence, reveal a strong alignment between synthetic and real-world opinions. This work demonstrates the potential of fine-tuned LLMs tailored to societal contexts to enable more ethical and precise policy simulations. Its broader implications include deploying LLMs in domains like healthcare and education, fostering inclusive and data-driven decision-making in both research and practice.
6 key data points NYPD will use to get the UnitedHealthcare CEO shooter
Surveillance video shows the suspect in the shooting of UnitedHealthcare CEO Brian Thompson on a bicycle near West 85th Street in Manhattan after the killing. The speculation regarding the shooting of UnitedHealthCare CEO Brian Thompson continues to run rampant. While this can be interesting, the truth is that the on-the-ground investigation will be far more prosaic than glamorous. It can be a daunting amount of information. As such, let's look at some hard data points that are likely jumping-off points for investigators who have to play the percentages (and some that are not): The idea that someone off the street can walk into a social club or call-a-guy-who-knows-a-guy who kills for a living is essentially a myth โ I cannot recall one in my experience.
LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds
Beetham, James, Chakraborty, Souradip, Wang, Mengdi, Huang, Furong, Bedi, Amrit Singh, Shah, Mubarak
Many existing jailbreak techniques rely on solving discrete combinatorial optimization, while more recent approaches involve training LLMs to generate multiple adversarial prompts. However, both approaches require significant computational resources to produce even a single adversarial prompt. We hypothesize that the inefficiency of current approaches stems from an inadequate characterization of the jailbreak problem. To address this gap, we formulate the jailbreak problem in terms of alignment. By starting from an available safety-aligned model, we leverage an unsafe reward to guide the safe model towards generating unsafe outputs using alignment techniques (e.g., reinforcement learning from human feedback), effectively performing jailbreaking via alignment. We propose a novel jailbreak method called LIAR (LeveragIng Alignment to jailbReak). To demonstrate the simplicity and effectiveness of our approach, we employ a best-of-N method to solve the alignment problem. LIAR offers significant advantages: lower computational requirements without additional training, fully black-box operation, competitive attack success rates, and more human-readable prompts. We provide theoretical insights into the possibility of jailbreaking a safety-aligned model, revealing inherent vulnerabilities in current alignment strategies for LLMs. We also provide sub-optimality guarantees for the proposed \algo. Experimentally, we achieve ASR comparable to the SoTA with a 10x improvement to perplexity and a Time-to-Attack measured in seconds rather than tens of hours.
Promoting Cooperation in the Public Goods Game using Artificial Intelligent Agents
Hintze, Arend, Adami, Christoph
The tragedy of the commons illustrates a fundamental social dilemma where individual rational actions lead to collectively undesired outcomes, threatening the sustainability of shared resources. Strategies to escape this dilemma, however, are in short supply. In this study, we explore how artificial intelligence (AI) agents can be leveraged to enhance cooperation in public goods games, moving beyond traditional regulatory approaches to using AI as facilitators of cooperation. We investigate three scenarios: (1) Mandatory Cooperation Policy for AI Agents, where AI agents are institutionally mandated always to cooperate; (2) Player-Controlled Agent Cooperation Policy, where players evolve control over AI agents' likelihood to cooperate; and (3) Agents Mimic Players, where AI agents copy the behavior of players. Using a computational evolutionary model with a population of agents playing public goods games, we find that only when AI agents mimic player behavior does the critical synergy threshold for cooperation decrease, effectively resolving the dilemma. This suggests that we can leverage AI to promote collective well-being in societal dilemmas by designing AI agents to mimic human players.
Upcycling Noise for Federated Unlearning
Chen, Jianan, Hu, Qin, Zhong, Fangtian, Zhuang, Yan, Xu, Minghui
In Federated Learning (FL), multiple clients collaboratively train a model without sharing raw data. This paradigm can be further enhanced by Differential Privacy (DP) to protect local data from information inference attacks and is thus termed DPFL. An emerging privacy requirement, ``the right to be forgotten'' for clients, poses new challenges to DPFL but remains largely unexplored. Despite numerous studies on federated unlearning (FU), they are inapplicable to DPFL because the noise introduced by the DP mechanism compromises their effectiveness and efficiency. In this paper, we propose Federated Unlearning with Indistinguishability (FUI) to unlearn the local data of a target client in DPFL for the first time. FUI consists of two main steps: local model retraction and global noise calibration, resulting in an unlearning model that is statistically indistinguishable from the retrained model. Specifically, we demonstrate that the noise added in DPFL can endow the unlearning model with a certain level of indistinguishability after local model retraction, and then fortify the degree of unlearning through global noise calibration. Additionally, for the efficient and consistent implementation of the proposed FUI, we formulate a two-stage Stackelberg game to derive optimal unlearning strategies for both the server and the target client. Privacy and convergence analyses confirm theoretical guarantees, while experimental results based on four real-world datasets illustrate that our proposed FUI achieves superior model performance and higher efficiency compared to mainstream FU schemes. Simulation results further verify the optimality of the derived unlearning strategies.
From Defects to Demands: A Unified, Iterative, and Heuristically Guided LLM-Based Framework for Automated Software Repair and Requirement Realization
Alex, null, Liu, null, Vivian, null, Chi, null
As software systems evolve, developers face a dual challenge: maintaining correctness by fixing bugs and continuously adapting functionality to meet new user demands. Traditional software engineering processes rely heavily on human developers to interpret requirements, fix errors, and ensure correctness against specifications. With the advancement of Large Language Models (LLMs) adept at code generation, the opportunity arises to shift portions of these responsibilities onto machine-driven processes. However, simply prompting an LLM to solve a complex programming task--be it eliminating a subtle bug or implementing a new feature--often falls short. Complex codebases exceed the model's context window, specification details are not always fully captured in a single prompt, and correctness requires iterative refinement guided by tests, analysis, and verification. This paper proposes a holistic, iterative framework that enables an LLM to evolve a codebase from an initial, potentially buggy state to one that satisfies not only pre-existing correctness criteria but also newly introduced feature demands. Key contributions include: 1. Unified Framework for Bug-to-Demand Resolution: We present a method by which the LLM iteratively refines code, starting from an imperfect state (with known or unknown bugs) and incrementally adjusting the codebase to meet a set of evolving functional and nonfunctional requirements introduced over time.
Multi-Objective Alignment of Large Language Models Through Hypervolume Maximization
Mukherjee, Subhojyoti, Lalitha, Anusha, Sengupta, Sailik, Deshmukh, Aniket, Kveton, Branislav
Multi-objective alignment from human feedback (MOAHF) in large language models (LLMs) is a challenging problem as human preferences are complex, multifaceted, and often conflicting. Recent works on MOAHF considered a-priori multi-objective optimization (MOO), where human preferences are known at training or inference time. In contrast, when human preferences are unknown or difficult to quantify, a natural approach is to cover the Pareto front by multiple diverse solutions. We propose an algorithm HaM for learning diverse LLM policies that maximizes their hypervolume. This is the first application of a-posteriori MOO to MOAHF. HaM is computationally and space efficient, and empirically superior across objectives such as harmlessness, helpfulness, humor, faithfulness, and hallucination, on various datasets.