Law
Extracting memorized pieces of (copyrighted) books from open-weight language models
Cooper, A. Feder, Gokaslan, Aaron, Ahmed, Ahmed, Cyphert, Amy B., De Sa, Christopher, Lemley, Mark A., Ho, Daniel E., Liang, Percy
Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expression in their training data. Drawing on both machine learning and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we extend a recent probabilistic extraction technique to measure memorization of 50 books in 17 open-weight LLMs. Through thousands of experiments, we show that the extent of memorization varies both by model and by book. With respect to our specific extraction methodology, we find that most LLMs do not memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B entirely memorizes some books, like the first Harry Potter book and 1984. In fact, the first Harry Potter is so memorized that, using a seed prompt consisting of just the first few tokens of the first chapter, we can deterministically generate the entire book near-verbatim. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.
How Ukraine turned into the world's drone testing lab
What is in the 28-point US plan for Ukraine? 'Ukraine is running out of men, money and time' Can the US get all sides to end the war? Why is Europe opposing Trump's peace plan? The Take How Ukraine turned into the world's drone testing lab The use of drones in the Russia-Ukraine war has revolutionised an industry of death and destruction. The rapid development of drone technology has changed how wars are fought.
Jorja Smith's record label hits out at 'AI clone' song
Brit Award-winning singer Jorja Smith's record label has said it wants a share of the royalties for a song it claims was created using an artificial intelligence clone of the singer's voice. I Run by British dance act Haven went viral on TiKTok in October thanks, in part, to smooth soul vocals by an uncredited female singer. Although I Run has now been re-released with new vocals, Smith's label FAMM said it believes the track was made with AI trained on her work, and is seeking compensation. It's bigger than one artist or one song, FAMM wrote in a statement on Instagram . The label said it believes both versions of the track infringe on Jorja's rights and unfairly take advantage of the work of all the songwriters with whom she collaborates.
From 'dinosaur tartare' to seaweed butter - would you try any of these dishes created by the world's first AI chef?
Prince William says he's'not in a calm state' as he arrives at the BAFTAs amid Andrew arrest drama: Prince of Wales says he's not in right frame of mind to watch weepy contender Hamnet - as Kate reveals it left her in floods of tears Who is Austin Tucker Martin? It's sensational, but William and Kate are the real King and Queen now. Read what my royal insiders are saying... it's the only way: MAUREEN CALLAHAN Tulsi Gabbard's personal life with mysterious videographer husband revealed in new intimate pictures I've met the man of my dreams... if he discovers my dirty little secret, he'll be disgusted: DEAR JANE JFK Jr took drugs'every single day': Everyone knows about Carolyn Bessette's cocaine snorting and cheating. But friends hid his binges, experimental sex and Jackie Kennedy's gay fears... until now Tide turns for little abandoned monkey Punch who had no one to love but his stuffed toy... as he's finally accepted into family Moment tourist minibus sinks in the world's deepest lake killing seven after crashing through the frozen ice Tucker Carlson forced to apologize to Israel's president for implying he went to Epstein's pedo island My American friends are all whispering the same rancid royal rumor. It's not just Andrew... this could bring everyone down: KENNEDY The Alexander brothers' alleged'rape playbook': Almost too monstrous to read, an exhaustive account of hideous secrets dating back to high school Vulgar squatter lazed around $2.3m mansion all day and sent child to work in BAKERY to help pay the bills... but now karma has caught up with her in the most delicious way The show must go on!
Are LLMs Good Safety Agents or a Propaganda Engine?
Yadav, Neemesh, Ortu, Francesco, Liu, Jiarui, Yook, Joeun, Schölkopf, Bernhard, Mihalcea, Rada, Cazzaniga, Alberto, Jin, Zhijing
Large Language Models (LLMs) are trained to refuse to respond to harmful content. However, systematic analyses of whether this behavior is truly a reflection of its safety policies or an indication of political censorship, that is practiced globally by countries, is lacking. Differentiating between safety influenced refusals or politically motivated censorship is hard and unclear. For this purpose we introduce PSP, a dataset built specifically to probe the refusal behaviors in LLMs from an explicitly political context. PSP is built by formatting existing censored content from two data sources, openly available on the internet: sensitive prompts in China generalized to multiple countries, and tweets that have been censored in various countries. We study: 1) impact of political sensitivity in seven LLMs through data-driven (making PSP implicit) and representation-level approaches (erasing the concept of politics); and, 2) vulnerability of models on PSP through prompt injection attacks (PIAs). Associating censorship with refusals on content with masked implicit intent, we find that most LLMs perform some form of censorship. We conclude with summarizing major attributes that can cause a shift in refusal distributions across models and contexts of different countries.
Machine learning for violence prediction: a systematic review and critical appraisal
Kozhevnikova, Stefaniya, Yukhnenko, Denis, Scola, Giulio, Fazel, Seena
Purpose To conduct a systematic review of machine learning models for predicting violent behaviour by synthesising and appraising their validity, usefulness, and performance. Methods We systematically searched nine bibliographic databases and Google Scholar up to September 2025 for development and/or validation studies on machine learning methods for predicting all forms of violent behaviour. We synthesised the results by summarising discrimination and calibration performance statistics and evaluated study quality by examining risk of bias and clinical utility. Results We identified 38 studies reporting the development and validation of 40 models. Most studies reported Area Under the Curve (AUC) as the discrimination statistic with a range of 0.68-0.99. Only eight studies reported calibration performance, and three studies reported external validation. 31 studies had a high risk of bias, mainly in the analysis domain, and three studies had low risk of bias. The overall clinical utility of violence prediction models is poor, as indicated by risks of overfitting due to small samples, lack of transparent reporting, and low generalisability. Conclusion Although black box machine learning models currently have limited applicability in clinical settings, they may show promise for identifying high-risk individuals. We recommend five key considerations for violence prediction modelling: (i) ensuring methodological quality (e.g. following guidelines) and interdisciplinary collaborations; (ii) using black box algorithms only for highly complex data; (iii) incorporating dynamic predictions to allow for risk monitoring; (iv) developing more trustworthy algorithms using explainable methods; and (v) applying causal machine learning approaches where appropriate.
JBE-QA: Japanese Bar Exam QA Dataset for Assessing Legal Domain Knowledge
Cao, Zhihan, Nishino, Fumihito, Yamada, Hiroaki, Thanh, Nguyen Ha, Miyao, Yusuke, Satoh, Ken
We introduce JBE-QA, a Japanese Bar Exam Question-Answering dataset to evaluate large language models' legal knowledge. Derived from the multiple-choice (tanto-shiki) section of the Japanese bar exam (2015-2024), JBE-QA provides the first comprehensive benchmark for Japanese legal-domain evaluation of LLMs. It covers the Civil Code, the Penal Code, and the Constitution, extending beyond the Civil Code focus of prior Japanese resources. Each question is decomposed into independent true/false judgments with structured contextual fields. The dataset contains 3,464 items with balanced labels. We evaluate 26 LLMs, including proprietary, open-weight, Japanese-specialised, and reasoning models. Our results show that proprietary models with reasoning enabled perform best, and the Constitution questions are generally easier than the Civil Code or the Penal Code questions.
RAG System for Supporting Japanese Litigation Procedures: Faithful Response Generation Complying with Legal Norms
Ishihara, Yuya, Keyaki, Atsushi, Yamada, Hiroaki, Ohara, Ryutaro, Sumida, Mihoko
This study discusses the essential components that a Retrieval-Augmented Generation (RAG)-based LLM system should possess in order to support Japanese medical litigation procedures complying with legal norms. In litigation, expert commissioners, such as physicians, architects, accountants, and engineers, provide specialized knowledge to help judges clarify points of dispute. When considering the substitution of these expert roles with a RAG-based LLM system, the constraint of strict adherence to legal norms is imposed. Specifically, three requirements arise: (1) the retrieval module must retrieve appropriate external knowledge relevant to the disputed issues in accordance with the principle prohibiting the use of private knowledge, (2) the responses generated must originate from the context provided by the RAG and remain faithful to that context, and (3) the retrieval module must reference external knowledge with appropriate timestamps corresponding to the issues at hand. This paper discusses the design of a RAG-based LLM system that satisfies these requirements.
RefineBench: Evaluating Refinement Capability of Language Models via Checklists
Lee, Young-Jun, Kim, Seungone, Lee, Byung-Kwan, Moon, Minkyeong, Hwang, Yechan, Kim, Jong Myoung, Neubig, Graham, Welleck, Sean, Choi, Ho-Jin
Can language models (LMs) self-refine their own responses? This question is increasingly relevant as a wide range of real-world user interactions involve refinement requests. However, prior studies have largely tested LMs' refinement abilities on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback on what they desire. The recent advent of reasoning models that exhibit self-reflection patterns in their chains-of-thought further motivates this question. To analyze this, we introduce RefineBench, a benchmark of 1,000 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve without guidance. In the self-refinement setting, even frontier LMs such as Gemini 2.5 Pro and GPT-5 achieve modest baseline scores of 31.3% and 29.1%, respectively, and most models fail to consistently improve across iterations (e.g., Gemini-2.5-Pro gains only +1.8%, while DeepSeek-R1 declines by -0.1%). By contrast, in guided refinement, both proprietary LMs and large open-weight LMs (>70B) can leverage targeted feedback to refine responses to near-perfect levels within five turns. These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses, and that RefineBench provides a valuable testbed for tracking progress.