Goto

Collaborating Authors

 Law


Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models

arXiv.org Artificial Intelligence

The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from financial reports to align with the legal requirements of accounting standards. However, a glaring limitation remains: these systems commonly fall short in verifying if the recommended excerpts indeed comply with the specific legal mandates. Hence, in this paper, we probe the efficiency of publicly available Large Language Models (LLMs) in the realm of regulatory compliance across different model configurations. We place particular emphasis on comparing cutting-edge open-source LLMs, such as Llama-2, with their proprietary counterparts like OpenAI's GPT models. This comparative analysis leverages two custom datasets provided by our partner PricewaterhouseCoopers (PwC) Germany. We find that the open-source Llama-2 70 billion model demonstrates outstanding performance in detecting non-compliance or true negative occurrences, beating all their proprietary counterparts. Nevertheless, proprietary models such as GPT-4 perform the best in a broad variety of scenarios, particularly in non-English contexts.


A Comprehensive Data-centric Overview of Federated Graph Learning

arXiv.org Artificial Intelligence

In the era of big data applications, Federated Graph Learning (FGL) has emerged as a prominent solution that reconcile the tradeoff between optimizing the collective intelligence between decentralized datasets holders and preserving sensitive information to maximum. Existing FGL surveys have contributed meaningfully but largely focus on integrating Federated Learning (FL) and Graph Machine Learning (GML), resulting in early stage taxonomies that emphasis on methodology and simulated scenarios. Notably, a data centric perspective, which systematically examines FGL methods through the lens of data properties and usage, remains unadapted to reorganize FGL research, yet it is critical to assess how FGL studies manage to tackle data centric constraints to enhance model performances. This survey propose a two-level data centric taxonomy: Data Characteristics, which categorizes studies based on the structural and distributional properties of datasets used in FGL, and Data Utilization, which analyzes the training procedures and techniques employed to overcome key data centric challenges. Each taxonomy level is defined by three orthogonal criteria, each representing a distinct data centric configuration. Beyond taxonomy, this survey examines FGL integration with Pretrained Large Models, showcases realistic applications, and highlights future direction aligned with emerging trends in GML.


Disability Across Cultures: A Human-Centered Audit of Ableism in Western and Indic LLMs

arXiv.org Artificial Intelligence

People with disabilities (PwD) experience disproportionately high levels of discrimination and hate online, particularly in India, where entrenched stigma and limited resources intensify these challenges. Large language models (LLMs) are increasingly used to identify and mitigate online hate, yet most research on online ableism focuses on Western audiences with Western AI models. Are these models adequately equipped to recognize ableist harm in non-Western places like India? Do localized, Indic language models perform better? To investigate, we adopted and translated a publicly available ableist speech dataset to Hindi, and prompted eight LLMs--four developed in the U.S. (GPT-4, Gemini, Claude, Llama) and four in India (Krutrim, Nanda, Gajendra, Airavata)--to score and explain ableism. In parallel, we recruited 175 PwD from both the U.S. and India to perform the same task, revealing stark differences between groups. Western LLMs consistently overestimated ableist harm, while Indic LLMs underestimated it. Even more concerning, all LLMs were more tolerant of ableism when it was expressed in Hindi and asserted Western framings of ableist harm. In contrast, Indian PwD interpreted harm through intention, relationality, and resilience--emphasizing a desire to inform and educate perpetrators. This work provides groundwork for global, inclusive standards of ableism, demonstrating the need to center local disability experiences in the design and evaluation of AI systems.


TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

arXiv.org Artificial Intelligence

Can AI file your taxes? Not yet. Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully compute results. We propose TaxCalcBench, a benchmark for determining models' abilities to calculate personal income tax returns given all of the necessary information. Our experiment shows that state-of-the-art models succeed in calculating less than a third of federal income tax returns even on this simplified sample set. Our analysis concludes that models consistently misuse tax tables, make errors in tax calculation, and incorrectly determine eligibility. Our findings point to the need for additional infrastructure to apply LLMs to the personal income tax calculation task.


"Just a strange pic": Evaluating 'safety' in GenAI Image safety annotation tasks from diverse annotators' perspectives

arXiv.org Artificial Intelligence

Understanding what constitutes safety in AI-generated content is complex. While developers often rely on predefined taxonomies, real-world safety judgments also involve personal, social, and cultural perceptions of harm. This paper examines how annotators evaluate the safety of AI-generated images, focusing on the qualitative reasoning behind their judgments. Analyzing 5,372 open-ended comments, we find that annotators consistently invoke moral, emotional, and contextual reasoning that extends beyond structured safety categories. Many reflect on potential harm to others more than to themselves, grounding their judgments in lived experience, collective risk, and sociocultural awareness. Beyond individual perceptions, we also find that the structure of the task itself -- including annotation guidelines -- shapes how annotators interpret and express harm. Guidelines influence not only which images are flagged, but also the moral judgment behind the justifications. Annotators frequently cite factors such as image quality, visual distortion, and mismatches between prompt and output as contributing to perceived harm dimensions, which are often overlooked in standard evaluation frameworks. Our findings reveal that existing safety pipelines miss critical forms of reasoning that annotators bring to the task. We argue for evaluation designs that scaffold moral reflection, differentiate types of harm, and make space for subjective, context-sensitive interpretations of AI-generated content.


Chabria: 3 things that should scare us about Trump's fake video of Obama

Los Angeles Times

On Sunday, our thoughtful and reserved president reposted on his Truth Social site a video generated by artificial intelligence that falsely showed former President Obama being arrested and imprisoned. There are those among you who think this is high humor; those among you who who find it as tiresome as it is offensive; and those among you blissfully unaware of the mental morass that is Truth Social. Whatever camp you fall into, the video crosses all demographics by being expected -- just another crazy Trump stunt in a repetitive cycle of division and diversion so frequent it makes Groundhog Day seem fresh. But there are three reasons why this particular video -- not made by the president but amplified to thousands -- is worth noting, and maybe even worth fearing. First, it is flat-out racist. In it, Obama is ripped out of a chair in the Oval Office and forced onto his knees, almost bowing, to a laughing Trump.


LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra

arXiv.org Artificial Intelligence

We present the LLM Economist, a novel framework that uses agent-based modeling to design and assess economic policies in strategic environments with hierarchical decision-making. At the lower level, bounded rational worker agents -- instantiated as persona-conditioned prompts sampled from U.S. Census-calibrated income and demographic statistics -- choose labor supply to maximize text-based utility functions learned in-context. At the upper level, a planner agent employs in-context reinforcement learning to propose piecewise-linear marginal tax schedules anchored to the current U.S. federal brackets. This construction endows economic simulacra with three capabilities requisite for credible fiscal experimentation: (i) optimization of heterogeneous utilities, (ii) principled generation of large, demographically realistic agent populations, and (iii) mechanism design -- the ultimate nudging problem -- expressed entirely in natural language. Experiments with populations of up to one hundred interacting agents show that the planner converges near Stackelberg equilibria that improve aggregate social welfare relative to Saez solutions, while a periodic, persona-level voting procedure furthers these gains under decentralized governance. These results demonstrate that large language model-based agents can jointly model, simulate, and govern complex economic systems, providing a tractable test bed for policy evaluation at the societal scale to help build better civilizations.


Challenges of Trustworthy Federated Learning: What's Done, Current Trends and Remaining Work

arXiv.org Artificial Intelligence

In recent years, the development of Trustworthy Artificial Intelligence (TAI) has emerged as a critical objective in the deployment of AI systems across sensitive and high-risk domains. TAI frameworks articulate a comprehensive set of ethical, legal, and technical requirements to ensure that AI technologies are aligned with human values, rights, and societal expectations. Among the various AI paradigms, Federated Learning (FL) presents a promising solution to pressing privacy concerns. However, aligning FL with the rest of the requirements of TAI presents a series of challenges, most of which arise from its inherently distributed nature. In this work, we adopt the requirements TAI as a guiding structure to systematically analyze the challenges of adapting FL to TAI. Specifically, we classify and examine the key obstacles to aligning FL with TAI, providing a detailed exploration of what has been done, the trends, and the remaining work within each of the identified challenges.


Applying the Chinese Wall Reverse Engineering Technique to Large Language Model Code Editing

arXiv.org Artificial Intelligence

This work does not provide legal advice, and do not claims that any legal opinion provided are correct 1 and block the model's output accordingly. This technique might not completely block partial matches and does not work for open source development as reproduction of the original code is expected. Some models address this issue by using curated datasets with appropriate licensed contents. For example, the Stack v2 dataset [2] and Starcoder2 model limits data to permissively licensed sources and contents with unknown license. The Common Pile dataset [3] and the accompanied Comma model improves on this by limiting the dataset to permissive licensed contents only. Most permissive licenses only have attribution as its primary sole licensing condition and may be easier to comply with than the GPLv2 license. Ideally, models that are trained on public domain contents may be the best in terms of legal compliance as they have no restrictions or requirements, but to our knowledge no such text generation models exist today with reasonable quality.


The Constitutional Controller: Doubt-Calibrated Steering of Compliant Agents

arXiv.org Artificial Intelligence

Ensuring reliable and rule-compliant behavior of autonomous agents in uncertain environments remains a fundamental challenge in modern robotics. Our work shows how neuro-symbolic systems, which integrate probabilistic, symbolic white-box reasoning models with deep learning methods, offer a powerful solution to this challenge. This enables the simultaneous consideration of explicit rules and neural models trained on noisy data, combining the strength of structured reasoning with flexible representations. To this end, we introduce the Constitutional Controller (CoCo), a novel framework designed to enhance the safety and reliability of agents by reasoning over deep probabilistic logic programs representing constraints such as those found in shared traffic spaces. Furthermore, we propose the concept of self-doubt, implemented as a probability density conditioned on doubt features such as travel velocity, employed sensors, or health factors. In a real-world aerial mobility study, we demonstrate CoCo's advantages for intelligent autonomous systems to learn appropriate doubts and navigate complex and uncertain environments safely and compliantly.