Goto

Collaborating Authors

 Law


Engineering the Law-Machine Learning Translation Problem: Developing Legally Aligned Models

arXiv.org Artificial Intelligence

Organizations developing machine learning-based (ML) technologies face the complex challenge of achieving high predictive performance while respecting the law. This intersection between ML and the law creates new complexities. As ML model behavior is inferred from training data, legal obligations cannot be operationalized in source code directly. Rather, legal obligations require "indirect" operationalization. However, choosing context-appropriate operationalizations presents two compounding challenges: (1) laws often permit multiple valid operationalizations for a given legal obligation-each with varying degrees of legal adequacy; and, (2) each operationalization creates unpredictable trade-offs among the different legal obligations and with predictive performance. Evaluating these trade-offs requires metrics (or heuristics), which are in turn difficult to validate against legal obligations. Current methodologies fail to fully address these interwoven challenges as they either focus on legal compliance for traditional software or on ML model development without adequately considering legal complexities. In response, we introduce a five-stage interdisciplinary framework that integrates legal and ML-technical analysis during ML model development. This framework facilitates designing ML models in a legally aligned way and identifying high-performing models that are legally justifiable. Legal reasoning guides choices for operationalizations and evaluation metrics, while ML experts ensure technical feasibility, performance optimization and an accurate interpretation of metric values. This framework bridges the gap between more conceptual analysis of law and ML models' need for deterministic specifications. We illustrate its application using a case study in the context of anti-money laundering.


Exploring How LLMs Capture and Represent Domain-Specific Knowledge

arXiv.org Artificial Intelligence

We study whether Large Language Models (LLMs) inherently capture domain-specific nuances in natural language. Our experiments probe the domain sensitivity of LLMs by examining their ability to distinguish queries from different domains using hidden states generated during the prefill phase. We reveal latent domain-related trajectories that indicate the model's internal recognition of query domains. We also study the robustness of these domain representations to variations in prompt styles and sources. Our approach leverages these representations for model selection, mapping the LLM that best matches the domain trace of the input query (i.e., the model with the highest performance on similar traces). Our findings show that LLMs can differentiate queries for related domains, and that the fine-tuned model is not always the most accurate. Unlike previous work, our interpretations apply to both closed and open-ended generative tasks. Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet the internal mechanisms driving these capabilities remain poorly understood. Different domains require distinct knowledge and reasoning patterns, necessitating LLMs to adjust decision-making based on-the-fly for input queries. This is crucial for applications demanding high reliability, such as legal and medical fields, where errors can lead to significant consequences. The research question of how LLMs adapt their decision-making and reasoning patterns across different domains is distinct from a growing body of work on locating factual associations from language models behavior (Meng et al., 2024; Hernandez et al., 2024a;b; Mitchell et al., 2022; Meng et al., 2023; Dai et al., 2022; Belrose et al., 2023). While these studies aim to identify the modules and computations that recall specific facts, primarily monitoring and controlling language generation, they often fall short in addressing the complexities of generative tasks. Understanding how LLMs adapt their reasoning across generative tasks is important for enhancing transparency in their decision-making processes.


Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL). While prior work has successfully applied RL to mathematical reasoning -- where rules and correctness are well-defined -- generalizing these methods to broader reasoning domains remains challenging due to limited data, the lack of verifiable reward structures, and diverse task requirements. In this work, we propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks. NEMOTRON-CROSSTHINK addresses key challenges by (1) incorporating data from varied sources spanning STEM, humanities, social sciences, etc.; (2) applying structured templates (e.g., multiple-choice and open-ended) to control answer-space complexity; (3) filtering for verifiable answers; and (4) optimizing data blending strategies that utilizes data from multiple sources effectively. Our approach enables scalable and verifiable reward modeling beyond mathematics and demonstrates improved accuracies on both math (MATH-500: +30.1%, AMC23:+27.5%) and non-math reasoning benchmarks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%, AGIEVAL: +15.1%, SUPERGPQA: +3.8%). Moreover, NEMOTRON-CROSSTHINK exhibits significantly improved response efficiency -- using 28% fewer tokens for correct answers -- highlighting more focused and effective reasoning. Through NEMOTRON-CROSSTHINK, we demonstrate that integrating multi-domain, multi-format data in RL leads to more accurate, efficient, and generalizable LLMs.


California Supreme Court demands State Bar answer questions on AI exam controversy

Los Angeles Times

The California Supreme Court urged the State Bar of California Thursday to explain how and why it utilized artificial intelligence to develop multiple-choice questions for its botched February bar exams. California's highest court, which oversees the State Bar, disclosed Tuesday that its justices were not informed before the exam that the State Bar had allowed its independent psychometrician to use AI to develop a small subset of questions. The Court on Thursday upped its public pressure on the State Bar, demanding it explain how it used AI to develop questions -- and what actions it took to ensure the reliability of the questions. The demand comes as the State Bar petitions the court to adjust test scores for hundreds of prospective California lawyers who complained of multiple technical problems and irregularities during the February exams. Using AI-developed questions written by non-legally-trained psychometricians represents'an obvious conflict of interest,' critics say.


Trump Wants to Blame Fed Chair Powell for Economic Downturn

Slate

This week, Emily Bazelon and David Plotz are joined by Henry Blodget to discuss the financial and political fallout from the President's threats to fire Fed Chair Powell and subsequent retreat; a Supreme Court case over free exercise of religion that could have broad implications; and why Trump stands by Hegseth after Signalgate Part 2. Here are some notes and references from this week's show: Colby Smith for The New York Times: Trump Says He Won't Fire Powell. His Fed Battle May Not Be Over Yet. America's economy is collateral damage Nicole Narea for Vox: Trump's tariffs are driving a gold rush Megan K. Stack for the New York Times (Opinion: Guest Essay): My School District Could Have Avoided This Supreme Court Case Neal McCluskey for Reason: The Supreme Court Is About To Hear 2 Education Cases. Ian Millhiser for Vox: The Supreme Court's "Don't Say Gay" argument went disastrously for public schools Aaron Blake for The Washington Post (Analysis): Even on the gravest of issues, GOP can't summon the will to question Trump Michael Crowley for The New York Times: Critics Call Rubio's Overhaul Plan a Blow to U.S. Values Here are this week's chatters: Henry: Christopher Lamb, Alicia Johnson, Jhasua Razo, and Sarah-Grace Mankarious for CNN: Who will be the next pope?


Google to report earnings amid justice department lawsuits and Trump tariffs

The Guardian

Google's parent company Alphabet will report its first quarter earnings on Thursday, which come as the tech giant is embroiled in antitrust lawsuits brought by the US government and a 17% drop in its stock price since the beginning of the year. It is also the company's first earnings report since Donald Trump levied tariffs on trade partners around the world. Despite the upheaval, analysts appear optimistic on Alphabet's outlook projecting first quarter revenue of 89.2bn, up 11% since the same time last year, and earnings of 2.01 per share, up 7%, according to consensus estimates. Analysts do not expect the global tariffs to create much of an impact for Alphabet, since they were mostly instituted after the end of the quarter. Alphabet is one of the world's most valuable companies, worth nearly 2trn.


OpenAI Wants to Go For-Profit. Experts Say Regulators Should Step In

TIME - Tech

In the latest development in an ongoing struggle over OpenAI's future direction--and potentially the future of artificial intelligence itself--dozens of prominent figures are urging the Attorneys General of California and Delaware to block OpenAI's controversial plan to convert from its unique nonprofit-controlled structure to a for-profit company. In a letter made public April 23, signatories including "AI Godfather" Geoffrey Hinton, Harvard legal professor Lawrence Lessig, and several former OpenAI researchers argue the move represents a fundamental betrayal of OpenAI's founding mission. "The proposed restructuring would eliminate essential safeguards, effectively handing control of, and profits from, what could be the most powerful technology ever created to a for-profit entity with legal duties to prioritize shareholder returns," the letter's authors write. It lands as OpenAI faces immense pressure from the other side: failing to implement the restructure by the end of the year could cost the company 20 billion and hamstring future fundraising. OpenAI was founded in 2015 as a non-profit, with its stated mission being to ensure that artificial general intelligence (AGI) "benefits all of humanity" rather than advancing "the private gain of any person."


James Bulger's mum seeks AI law to curb clips of murder victims

BBC News

There were plans to include measures to force social media companies to remove some "legal-but-harmful" content in the Online Safety Act, before it became law. But the proposals were scrapped over censorship concerns. Online safety campaigners argue the rules around removing harmful content needed tightening to close loopholes in the act. In January this year, Technology Secretary Peter Kyle told the BBC he had "inherited an unsatisfactory legislative settlement" in the Online Safety Act. "I'm very open-minded and I've said publicly, I think we'll have to legislate into the future again," Kyle said.


How Effective are Generative Large Language Models in Performing Requirements Classification?

arXiv.org Artificial Intelligence

In recent years, transformer-based large language models (LLMs) have revolutionised natural language processing (NLP), with generative models opening new possibilities for tasks that require context-aware text generation. Requirements engineering (RE) has also seen a surge in the experimentation of LLMs for different tasks, including trace-link detection, regulatory compliance, and others. Requirements classification is a common task in RE. While non-generative LLMs like BERT have been successfully applied to this task, there has been limited exploration of generative LLMs. This gap raises an important question: how well can generative LLMs, which produce context-aware outputs, perform in requirements classification? In this study, we explore the effectiveness of three generative LLMs-Bloom, Gemma, and Llama-in performing both binary and multi-class requirements classification. We design an extensive experimental study involving over 400 experiments across three widely used datasets (PROMISE NFR, Functional-Quality, and SecReq). Our study concludes that while factors like prompt design and LLM architecture are universally important, others-such as dataset variations-have a more situational impact, depending on the complexity of the classification task. This insight can guide future model development and deployment strategies, focusing on optimising prompt structures and aligning model architectures with task-specific needs for improved performance.


A Unified Retrieval Framework with Document Ranking and EDU Filtering for Multi-document Summarization

arXiv.org Artificial Intelligence

In the field of multi-document summarization (MDS), transformer-based models have demonstrated remarkable success, yet they suffer an input length limitation. Current methods apply truncation after the retrieval process to fit the context length; however, they heavily depend on manually well-crafted queries, which are impractical to create for each document set for MDS. Additionally, these methods retrieve information at a coarse granularity, leading to the inclusion of irrelevant content. To address these issues, we propose a novel retrieval-based framework that integrates query selection and document ranking and shortening into a unified process. Our approach identifies the most salient elementary discourse units (EDUs) from input documents and utilizes them as latent queries. These queries guide the document ranking by calculating relevance scores. Instead of traditional truncation, our approach filters out irrelevant EDUs to fit the context length, ensuring that only critical information is preserved for summarization. We evaluate our framework on multiple MDS datasets, demonstrating consistent improvements in ROUGE metrics while confirming its scalability and flexibility across diverse model architectures. Additionally, we validate its effectiveness through an in-depth analysis, emphasizing its ability to dynamically select appropriate queries and accurately rank documents based on their relevance scores. These results demonstrate that our framework effectively addresses context-length constraints, establishing it as a robust and reliable solution for MDS.