Law
Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
Prandi, Matteo, Suriani, Vincenzo, Pierucci, Federico, Galisai, Marcello, Nardi, Daniele, Bisconti, Piercosma
The rapid advancement of General Purpose AI (GPAI) models necessitates robust evaluation frameworks, especially with emerging regulations like the EU AI Act and its associated Code of Practice (CoP). Current AI evaluation practices depend heavily on established benchmarks, but these tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape. This research addresses the urgent need to quantify this "benchmark-regulation gap." We introduce Bench-2-CoP, a novel, systematic framework that uses validated LLM-as-judge analysis to map the coverage of 194,955 questions from widely-used benchmarks against the EU AI Act's taxonomy of model capabilities and propensities. Our findings reveal a profound misalignment: the evaluation ecosystem dedicates the vast majority of its focus to a narrow set of behavioral propensities. On average, benchmarks devote 61.6% of their regulatory-relevant questions to "Tendency to hallucinate" and 31.2% to "Lack of performance reliability", while critical functional capabilities are dangerously neglected. Crucially, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This study provides the first comprehensive, quantitative analysis of this gap, demonstrating that current public benchmarks are insufficient, on their own, for providing the evidence of comprehensive risk assessment required for regulatory compliance and offering critical insights for the development of next-generation evaluation tools.
Moment Estimate and Variational Approach for Learning Generalized Diffusion with Non-gradient Structures
Kong, Fanze, Lai, Chen-Chih, Lu, Yubin
This paper proposes a data-driven learning framework for identifying governing laws of generalized diffusions with non-gradient components. By combining energy dissipation laws with a physically consistent penalty and first-moment evolution, we design a two-stage method to recover the pseudo-potential and rotation in the pointwise orthogonal decomposition of a class of non-gradient drifts in generalized diffusions. Our two-stage method is applied to complex generalized diffusion processes including dissipation-rotation dynamics, rough pseudo-potentials and noisy data. Representative numerical experiments demonstrate the effectiveness of our approach for learning physical laws in non-gradient generalized diffusions.
Nyay-Darpan: Enhancing Decision Making Through Summarization and Case Retrieval for Consumer Law in India
Bhattacharyya, Swapnil, Kashid, Harshvivek, Ganatra, Shrey, Anaokar, Spandan, Nair, Shruti, Sekhar, Reshma, Manohar, Siddharth, Hemrajani, Rahul, Bhattacharyya, Pushpak
AI-based judicial assistance and case prediction have been extensively studied in criminal and civil domains, but remain largely unexplored in consumer law, especially in India. In this paper, we present Nyay-Darpan, a novel two-in-one framework that (i) summarizes consumer case files and (ii) retrieves similar case judgements to aid decision-making in consumer dispute resolution. Our methodology not only addresses the gap in consumer law AI tools but also introduces an innovative approach to evaluate the quality of the summary. The term 'Nyay-Darpan' translates into 'Mirror of Justice', symbolizing the ability of our tool to reflect the core of consumer disputes through precise summarization and intelligent case retrieval. Our system achieves over 75 percent accuracy in similar case prediction and approximately 70 percent accuracy across material summary evaluation metrics, demonstrating its practical effectiveness. We will publicly release the Nyay-Darpan framework and dataset to promote reproducibility and facilitate further research in this underexplored yet impactful domain.
Solving Copyright Infringement on Short Video Platforms: Novel Datasets and an Audio Restoration Deep Learning Pipeline
Oh, Minwoo, Park, Minsu, Park, Eunil
Short video platforms like YouTube Shorts and TikTok face significant copyright compliance challenges, as infringers frequently embed arbitrary background music (BGM) to obscure original soundtracks (OST) and evade content originality detection. To tackle this issue, we propose a novel pipeline that integrates Music Source Separation (MSS) and cross-modal video-music retrieval (CMVMR). Our approach effectively separates arbitrary BGM from the original OST, enabling the restoration of authentic video audio tracks. To support this work, we introduce two domain-specific datasets: OASD-20K for audio separation and OSVAR-160 for pipeline evaluation. OASD-20K contains 20,000 audio clips featuring mixed BGM and OST pairs, while OSVAR-160 is a unique benchmark dataset comprising 1,121 video and mixed-audio pairs, specifically designed for short video restoration tasks. Experimental results demonstrate that our pipeline not only removes arbitrary BGM with high accuracy but also restores OSTs, ensuring content integrity. This approach provides an ethical and scalable solution to copyright challenges in user-generated content on short video platforms.
Staff at UK's top AI institute complain to watchdog about its internal culture
Staff at the UK's leading artificial intelligence institute have raised concerns about the organisation's governance and internal culture in a whistleblowing complaint to the charity watchdog. The Alan Turing Institute (ATI), a registered charity with substantial state funding, is under government pressure to overhaul its strategic focus and leadership after an intervention last month from the technology secretary, Peter Kyle. In a complaint to the Charity Commission, a group of current ATI staff raise eight points of concern and say the institute is in danger of collapse due to government threats over its funding. The complaint alleges that the board of trustees, chaired by the former Amazon UK boss Doug Gurr, has failed to fulfil core legal duties such as providing strategic direction and ensuring accountability, with staff alleging a letter of no confidence was delivered last year and not acted upon. A spokesperson for ATI said the Charity Commission had not been in touch with the institute about any complaints that may have been sent to the organisation.
Major Japan newspaper sues 'free-riding' AI firm Perplexity
Japan's Yomiuri Shimbun newspaper, one of the world's biggest by circulation, is suing U.S.-based AI firm Perplexity for allegedly "free-riding" on its content on its search engine. The lawsuit filed Thursday is one of a slew by media companies worldwide against AI firms using their material and is the first by a major Japanese news organization, Yomiuri said. It accuses Perplexity of "free-riding on the results of the activities of news organizations, which have invested a great deal of effort and expense." A spokesman for the paper added that this "could have a negative impact on accurate journalism ... and shake the foundations of democracy." The lawsuit filed in Tokyo seeks damages of 2.2 billion ( 14.7 million), equivalent to 120,000 Yomuiri articles used "without permission" between February and June.
CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation
S, Santosh T. Y. S., Elkhayat, Youssef Tarek, Ichim, Oana, Shetty, Pranav, Wang, Dongsheng, Ma, Zhiqiang, Nourbakhsh, Armineh, Liu, Xiaomo
Due to their ability to process long and complex contexts, LLMs can offer key benefits to the Legal domain, but their adoption has been hindered by their tendency to generate unfaithful, ungrounded, or hallucinatory outputs. While Retrieval-Augmented Generation offers a promising solution by grounding generations in external knowledge, it offers no guarantee that the provided context will be effectively integrated. To address this, context-aware decoding strategies have been proposed to amplify the influence of relevant context, but they usually do not explicitly enforce faithfulness to the context. In this work, we introduce Confidence-guided Copy-based Decoding for Legal Text Generation (CoCoLex)-a decoding strategy that dynamically interpolates the model produced vocabulary distribution with a distribution derived based on copying from the context. CoCoLex encourages direct copying based on the model's confidence, ensuring greater fidelity to the source. Experimental results on five legal benchmarks demonstrate that CoCoLex outperforms existing context-aware decoding methods, particularly in long-form generation tasks.
Competing Risks: Impact on Risk Estimation and Algorithmic Fairness
Jeanselme, Vincent, Tom, Brian, Barrett, Jessica
Accurate time-to-event prediction is integral to decision-making, informing medical guidelines, hiring decisions, and resource allocation. Survival analysis, the quantitative framework used to model time-to-event data, accounts for patients who do not experience the event of interest during the study period, known as censored patients. However, many patients experience events that prevent the observation of the outcome of interest. These competing risks are often treated as censoring, a practice frequently overlooked due to a limited understanding of its consequences. Our work theoretically demonstrates why treating competing risks as censoring introduces substantial bias in survival estimates, leading to systematic overestimation of risk and, critically, amplifying disparities. First, we formalize the problem of misclassifying competing risks as censoring and quantify the resulting error in survival estimates. Specifically, we develop a framework to estimate this error and demonstrate the associated implications for predictive performance and algorithmic fairness. Furthermore, we examine how differing risk profiles across demographic groups lead to group-specific errors, potentially exacerbating existing disparities. Our findings, supported by an empirical analysis of cardiovascular management, demonstrate that ignoring competing risks disproportionately impacts the individuals most at risk of these events, potentially accentuating inequity. By quantifying the error and highlighting the fairness implications of the common practice of considering competing risks as censoring, our work provides a critical insight into the development of survival models: practitioners must account for competing risks to improve accuracy, reduce disparities in risk assessment, and better inform downstream decisions.
NomicLaw: Emergent Trust and Strategic Argumentation in LLMs During Collaborative Law-Making
Hota, Asutosh, Jokinen, Jussi P. P.
Recent advancements in large language models (LLMs) have extended their capabilities from basic text processing to complex reasoning tasks, including legal interpretation, argumentation, and strategic interaction. However, empirical understanding of LLM behavior in open-ended, multi-agent settings especially those involving deliberation over legal and ethical dilemmas remains limited. We introduce Nomi-cLaw, a structured multi-agent simulation where LLMs engage in collaborative law-making, responding to complex legal vignettes by proposing rules, justifying them, and voting on peer proposals. We quantitatively measure trust and reciprocity via voting patterns and qualitatively assess how agents use strategic language to justify proposals and influence outcomes. Experiments involving homogeneous and heterogeneous LLM groups demonstrate how agents spontaneously form alliances, betray trust, and adapt their rhetoric to shape collective decisions. Our results highlight the latent social reasoning and persuasive capabilities of ten open-source LLMs and provide insights into the design of future AI systems capable of autonomous negotiation, coordination and drafting legislation in legal settings.
I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations
Kharchenko, Julia, Roosta, Tanya, Chadha, Aman, Shah, Chirag
This paper introduces a comprehensive benchmark for evaluating how Large Language Models (LLMs) respond to linguistic shibboleths: subtle linguistic markers that can inadvertently reveal demographic attributes such as gender, social class, or regional background. Through carefully constructed interview simulations using 100 validated question-response pairs, we demonstrate how LLMs systematically penalize certain linguistic patterns, particularly hedging language, despite equivalent content quality. Our benchmark generates controlled linguistic variations that isolate specific phenomena while maintaining semantic equivalence, which enables the precise measurement of demographic bias in automated evaluation systems. We validate our approach along multiple linguistic dimensions, showing that hedged responses receive 25.6% lower ratings on average, and demonstrate the benchmark's effectiveness in identifying model-specific biases. This work establishes a foundational framework for detecting and measuring linguistic discrimination in AI systems, with broad applications to fairness in automated decision-making contexts.