Law
Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models
Costa, Davi Bastos, Alves, Felippe, Vicente, Renato
Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across and within personas, respectively. We find that, for moral robustness, model family accounts for most of the variance, while model size shows no systematic effect. The Claude family is, by a significant margin, the most robust, followed by Gemini and GPT-4 models, with other families exhibiting lower robustness. In contrast, moral susceptibility exhibits a mild family effect but a clear within-family size effect, with larger variants being more susceptible. Moreover, robustness and susceptibility are positively correlated, an association that is more pronounced at the family level. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in large language models.
ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech
Koniaris, Marios, Tsipi, Argyro, Tsanakas, Panayiotis
Parliamentary speech generation presents specific challenges for large language models beyond standard text generation tasks. Unlike general text generation, parliamentary speeches require not only linguistic quality but also political authenticity and ideological consistency. Current language models lack specialized training for parliamentary contexts, and existing evaluation methods focus on standard NLP metrics rather than political authenticity. To address this, we present ParliaBench, a benchmark for parliamentary speech generation. We constructed a dataset of speeches from UK Parliament to enable systematic model training. We introduce an evaluation framework combining computational metrics with LLM-as-a-judge assessments for measuring generation quality across three dimensions: linguistic quality, semantic coherence, and political authenticity. We propose two novel embedding-based metrics, Political Spectrum Alignment and Party Alignment, to quantify ideological positioning. We fine-tuned five large language models (LLMs), generated 28k speeches, and evaluated them using our framework, comparing baseline and fine-tuned models. Results show that fine-tuning produces statistically significant improvements across the majority of metrics and our novel metrics demonstrate strong discriminative power for political dimensions.
Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents
AlShikh, Waseem, Ali, Muayad Sayed, Kennedy, Brian, Mozolevskyi, Dmytro
As AI agents proliferate across industries and applications, evaluating their performance based solely on infrastructural metrics such as latency, time-to-first-token, or token throughput is proving insufficient. These metrics fail to capture the quality of an agent's decisions, its operational autonomy, or its ultimate business value. This white paper proposes a novel, comprehensive framework of eleven outcome-based, task-agnostic performance metrics for AI agents that transcend domain boundaries. These metrics are designed to enable organizations to evaluate agents based on the quality of their decisions, their degree of autonomy, their adaptability to new challenges, and the tangible business value they deliver, regardless of the underlying model architecture or specific use case. We introduce metrics such as Goal Completion Rate (GCR), Autonomy Index (AIx), Multi-Step Task Resilience (MTR), and Business Impact Efficiency (BIE). Through a large-scale simulated experiment involving four distinct agent architectures (ReAct, Chain-of-Thought, Tool-Augmented, Hybrid) across five diverse domains (Healthcare, Finance, Marketing, Legal, and Customer Service), we demonstrate the framework's efficacy. Our results reveal significant performance trade-offs between different agent designs, highlighting the Hybrid Agent as the most consistently high-performing model across the majority of our proposed metrics, achieving an average Goal Completion Rate of 88.8\% and the highest Return on Investment (ROI). This work provides a robust, standardized methodology for the holistic evaluation of AI agents, paving the way for more effective development, deployment, and governance.
FedPoP: Federated Learning Meets Proof of Participation
ฤฐลler, Devriล, van Kempen, Elina, Hwang, Seoyeon, Laoutaris, Nikolaos
Abstract--Federated learning (FL) offers privacy preserving, distributed machine learning, allowing clients to contribute to a global model without revealing their local data. As models increasingly serve as monetizable digital assets, the ability to prove participation in their training becomes essential for establishing ownership. In this paper, we address this emerging need by introducing FedPoP, a novel FL framework that allows non-linkable proof of participation while preserving client anonymity and privacy without requiring either extensive computations or a public ledger . FedPoP is designed to seamlessly integrate with existing secure aggregation protocols to ensure compatibility with real-world FL deployments. We provide a proof of concept implementation and an empirical evaluation under realistic client dropouts. In our prototype, FedPoP introduces 0.97 seconds of per-round overhead atop securely aggregated FL and enables a client to prove its participation/contribution to a model held by a third party in 0.0612 seconds. These results indicate FedPoP is practical for real-world deployments that require auditable participation without sacrificing privacy. Federated learning (FL) [1] has become one of the innovative distributed machine learning structures wherein private data holders (a.k.a. The most common FL setting involves three parties: a server who initiates a model and aggregates training data (local models) from clients, a large number of clients who collaboratively train the model, and a service provider who deploys the model to provide services to its users. In a nutshell, an FL system consists of iterative aggregation rounds where 1) the server sends global model parameters to clients; 2) each client trains the model using its own private data and transmits updated parameters to the server; and 3) the server aggregates the updated parameters sent by the clients into a new global model using an aggregation procedure (e.g., FedAvg [1], FedQV [2]). The final global model is delivered to the service provider when the training is completed.
oboro: Text-to-Image Synthesis on Limited Data using Flow-based Diffusion Transformer with MMH Attention
Mizutani, Ryusuke, Matano, Kazuaki, Kadowaki, Tsugumi, Tenya, Haruki, Layris, null, nuigurumi, null, Hashimoto, Koki, Tanaka, Yu
This project was conducted as a 2nd-term adopted project of the "Post-5G Information and Communication System Infrastructure Enhancement R&D Project Development of Competitive Generative AI Foundation Models (GENIAC)," a business of the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO). To address challenges such as labor shortages in Japan's anime production industry, this project aims to develop an image generation model from scratch. This report details the technical specifications of the developed image generation model, "oboro:." We have developed "oboro:," a new image generation model built from scratch, using only copyright-cleared images for training. A key characteristic is its architecture, designed to generate high-quality images even from limited datasets. The foundation model weights and inference code are publicly available alongside this report. This project marks the first release of an open-source, commercially-oriented image generation AI fully developed in Japan. AiHUB originated from the OSS community; by maintaining transparency in our development process, we aim to contribute to Japan's AI researcher and engineer community and promote the domestic AI development ecosystem.
A robust methodology for long-term sustainability evaluation of Machine Learning models
Paz-Ruza, Jorge, Gama, Joรฃo, Alonso-Betanzos, Amparo, Guijarro-Berdiรฑas, Bertha
Among the many desirable properties of Artificial Intelligence systems, sustainability and efficiency have become increasingly important in the context of worsening climate change, massive water use in data centres, and the need for simpler, faster models in IoT settings. Consequently, there have been not few attempts to both promote and regulate the sustainability of Machine Learning models; the EU's AI Act indicates that the sustainability of AI - in terms of its environmental and social footprint-should be considered when developing and deploying AI pipelines [1], and manifests like that of UNESCO highlight sustainability as one of the core principles of the broader Responsible AI paradigm [2]. However, this seemingly consensual agreement on the importance of sustainability and efficiency for real-world AI systems and the social and regulatory efforts heavily contrasts with the practical applicability of such regulations; without looking further, the AI Act itself defines the requirement for sustainability, but does not indicate what metrics and evaluation pipelines should be considered for a robust, reliable, and practically relevant assessment of the environmental impact of a model. We argue that this lack of comprehensiveness in sustainability recommendations for AI systems does not stem from a careless or sloppy construction of the regulations themselves, but rather from an actual absence of suitable evaluation protocols that are formal, model-agnostic, reproducible, and grounded in real-life usage protocols for the ML lifecycle. The authors of this preprint are aware of a single regulatory standard for measuring AI sustainability, namely UNE 0086 [3], which limits evaluation to the epoch-batch training paradigm of supervised learning systems, rendering it useless for any task or type of learning that deviates from that standard. Although many researchers and companies have made it a habit to report efficiency figures and comparisons (e.g., in terms of emitted CO
Continual Unlearning for Text-to-Image Diffusion Models: A Regularization Perspective
Lee, Justin, Mai, Zheda, Yoo, Jinsu, Fan, Chongyu, Zhang, Cheng, Chao, Wei-Lun
Machine unlearning--the ability to remove designated concepts from a pre-trained model--has advanced rapidly, particularly for text-to-image diffusion models. However, existing methods typically assume that unlearning requests arrive all at once, whereas in practice they often arrive sequentially. We present the first systematic study of continual unlearning in text-to-image diffusion models and show that popular unlearning methods suffer from rapid utility collapse: after only a few requests, models forget retained knowledge and generate degraded images. We trace this failure to cumulative parameter drift from the pre-training weights and argue that regularization is crucial to addressing it. To this end, we study a suite of add-on regularizers that (1) mitigate drift and (2) remain compatible with existing unlearning methods. Beyond generic regularizers, we show that semantic awareness is essential for preserving concepts close to the unlearning target, and propose a gradient-projection method that constrains parameter drift orthogonal to their subspace. This substantially improves continual unlearning performance and is complementary to other regularizers for further gains. Taken together, our study establishes continual unlearning as a fundamental challenge in text-to-image generation and provides insights, baselines, and open directions for advancing safe and accountable generative AI.
Judging by the Rules: Compliance-Aligned Framework for Modern Slavery Statement Monitoring
Xu, Wenhao, Arodi, Akshatha, Nie, Jian-Yun, Tchango, Arsene Fansi
Modern slavery affects millions of people worldwide, and regulatory frameworks such as Modern Slavery Acts now require companies to publish detailed disclosures. However, these statements are often vague and inconsistent, making manual review time-consuming and difficult to scale. While NLP offers a promising path forward, high-stakes compliance tasks require more than accurate classification: they demand transparent, rule-aligned outputs that legal experts can verify. Existing applications of large language models (LLMs) often reduce complex regulatory assessments to binary decisions, lacking the necessary structure for robust legal scrutiny. We argue that compliance verification is fundamentally a rule-matching problem: it requires evaluating whether textual statements adhere to well-defined regulatory rules. To this end, we propose a novel framework that harnesses AI for rule-level compliance verification while preserving expert oversight. At its core is the Compliance Alignment Judge (CA-Judge), which evaluates model-generated justifications based on their fidelity to statutory requirements. Using this feedback, we train the Compliance Alignment LLM (CALLM), a model that produces rule-consistent, human-verifiable outputs. CALLM improves predictive performance and generates outputs that are both transparent and legally grounded, offering a more verifiable and actionable solution for real-world compliance analysis.
Design, Results and Industry Implications of the World's First Insurance Large Language Model Evaluation Benchmark
Zhou, Hua, Ma, Bing, Zhang, Yufei, Zhao, Yi
This paper comprehensively elaborates on the construction methodology, multi-dimensional evaluation system, and underlying design philosophy of CUFEInse v1.0. Adhering to the principles of "quantitative-oriented, expert-driven, and multi-validation," the benchmark establishes an evaluation framework covering 5 core dimensions, 54 sub-indicators, and 14,430 high-quality questions, encompassing insurance theoretical knowledge, industry understanding, safety and compliance, intelligent agent application, and logical rigor. Based on this benchmark, a comprehensive evaluation was conducted on 11 mainstream large language models. The evaluation results reveal that general-purpose models suffer from common bottlenecks such as weak actuarial capabilities and inadequate compliance adaptation. High-quality domain-specific training demonstrates significant advantages in insurance vertical scenarios but exhibits shortcomings in business adaptation and compliance. The evaluation also accurately identifies the common bottlenecks of current large models in professional scenarios such as insurance actuarial, underwriting and claim settlement reasoning, and compliant marketing copywriting. The establishment of CUFEInse not only fills the gap in professional evaluation benchmarks for the insurance field, providing academia and industry with a professional, systematic, and authoritative evaluation tool, but also its construction concept and methodology offer important references for the evaluation paradigm of large models in vertical fields, serving as an authoritative reference for academic model optimization and industrial model selection. Finally, the paper looks forward to the future iteration direction of the evaluation benchmark and the core development direction of "domain adaptation + reasoning enhancement" for insurance large models.
Stress Testing Factual Consistency Metrics for Long-Document Summarization
Mujahid, Zain Muhammad, Wright, Dustin, Augenstein, Isabelle
Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies. In this work, we systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization, in the long-document setting. We probe metric robustness through seven factuality-preserving perturbations applied to summaries, namely paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, and further analyze their sensitivity to retrieval context and claim information density. Across three long-form benchmark datasets spanning science fiction, legal, and scientific domains, our results reveal that existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims whose content is semantically similar to many parts of the source document. While expanding the retrieval context improves stability in some domains, no metric consistently maintains factual alignment under long-context conditions. Finally, our results highlight concrete directions for improving factuality evaluation, including multi-span reasoning, context-aware calibration, and training on meaning-preserving variations to enhance robustness in long-form summarization. We release all code, perturbed data, and scripts required to reproduce our results at https://github.com/zainmujahid/metricEval-longSum.