Law
Societal Alignment Frameworks Can Improve LLM Alignment
Stańczak, Karolina, Meade, Nicholas, Bhatia, Mehar, Zhou, Hattie, Böttinger, Konstantin, Barnes, Jeremy, Stanley, Jason, Montgomery, Jessica, Zemel, Richard, Papernot, Nicolas, Chapados, Nicolas, Therien, Denis, Lillicrap, Timothy P., Marasović, Ana, Delacroix, Sylvie, Hadfield, Gillian K., Reddy, Siva
Recent progress in large language models (LLMs) has focused on producing responses that meet human expectations and align with shared values - a process coined alignment. However, aligning LLMs remains challenging due to the inherent disconnect between the complexity of human values and the narrow nature of the technological approaches designed to address them. Current alignment methods often lead to misspecified objectives, reflecting the broader issue of incomplete contracts, the impracticality of specifying a contract between a model developer, and the model that accounts for every scenario in LLM alignment. In this paper, we argue that improving LLM alignment requires incorporating insights from societal alignment frameworks, including social, economic, and contractual alignment, and discuss potential solutions drawn from these domains. Given the role of uncertainty within societal alignment frameworks, we then investigate how it manifests in LLM alignment. We end our discussion by offering an alternative view on LLM alignment, framing the underspecified nature of its objectives as an opportunity rather than perfect their specification. Beyond technical improvements in LLM alignment, we discuss the need for participatory alignment interface designs.
LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation
Li, Haitao, Chen, Yifan, Hu, Yiran, Ai, Qingyao, Chen, Junjie, Yang, Xiaoyu, Yang, Jianhui, Wu, Yueyue, Liu, Zeyang, Liu, Yiqun
Retrieval-augmented generation (RAG) has proven highly effective in improving large language models (LLMs) across various domains. However, there is no benchmark specifically designed to assess the effectiveness of RAG in the legal domain, which restricts progress in this area. To fill this gap, we propose LexRAG, the first benchmark to evaluate RAG systems for multi-turn legal consultations. LexRAG consists of 1,013 multi-turn dialogue samples and 17,228 candidate legal articles. Each sample is annotated by legal experts and consists of five rounds of progressive questioning. LexRAG includes two key tasks: (1) Conversational knowledge retrieval, requiring accurate retrieval of relevant legal articles based on multi-turn context. (2) Response generation, focusing on producing legally sound answers. To ensure reliable reproducibility, we develop LexiT, a legal RAG toolkit that provides a comprehensive implementation of RAG system components tailored for the legal domain. Additionally, we introduce an LLM-as-a-judge evaluation pipeline to enable detailed and effective assessment. Through experimental analysis of various LLMs and retrieval methods, we reveal the key limitations of existing RAG systems in handling legal consultation conversations. LexRAG establishes a new benchmark for the practical application of RAG systems in the legal domain, with its code and data available at https://github.com/CSHaitao/LexRAG.
Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization
Barron, Ryan C., Eren, Maksim E., Serafimova, Olga M., Matuszek, Cynthia, Alexandrov, Boian S.
Agentic Generative AI, powered by Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Vector Stores (VSs), represents a transformative technology applicable to specialized domains such as legal systems, research, recommender systems, cybersecurity, and global security, including proliferation research. This technology excels at inferring relationships within vast unstructured or semi-structured datasets. The legal domain here comprises complex data characterized by extensive, interrelated, and semi-structured knowledge systems with complex relations. It comprises constitutions, statutes, regulations, and case law. Extracting insights and navigating the intricate networks of legal documents and their relations is crucial for effective legal research. Here, we introduce a generative AI system that integrates RAG, VS, and KG, constructed via Non-Negative Matrix Factorization (NMF), to enhance legal information retrieval and AI reasoning and minimize hallucinations. In the legal system, these technologies empower AI agents to identify and analyze complex connections among cases, statutes, and legal precedents, uncovering hidden relationships and predicting legal trends-challenging tasks that are essential for ensuring justice and improving operational efficiency. Our system employs web scraping techniques to systematically collect legal texts, such as statutes, constitutional provisions, and case law, from publicly accessible platforms like Justia. It bridges the gap between traditional keyword-based searches and contextual understanding by leveraging advanced semantic representations, hierarchical relationships, and latent topic discovery. This framework supports legal document clustering, summarization, and cross-referencing, for scalable, interpretable, and accurate retrieval for semi-structured data while advancing computational law and AI.
How Much is Enough? The Diminishing Returns of Tokenization Training Data
Reddy, Varshini, Schmidt, Craig W., Pinter, Yuval, Tanner, Chris
Tokenization, a crucial initial step in natural language processing, is often assumed to benefit from larger training datasets. This paper investigates the impact of tokenizer training data sizes ranging from 1GB to 900GB. Our findings reveal diminishing returns as the data size increases, highlighting a practical limit on how much further scaling the training data can improve tokenization quality. We analyze this phenomenon and attribute the saturation effect to the constraints imposed by the pre-tokenization stage of tokenization. These results offer valuable insights for optimizing the tokenization process and highlight potential avenues for future research in tokenization algorithms.
Prioritise artists over tech in AI copyright debate, MPs say
Two cross-party committees of MPs have urged the government to prioritise ensuring that creators are fairly remunerated for their creative work over making it easy to train artificial intelligence models. The MPs argued there needed to be more transparency around the vast amounts of data used to train generative AI models, and urged the government not to press ahead with plans to require creators to opt out of having their data used. The chair of the culture, media and sport committee, Caroline Dinenage, said there had been a "groundswell of concern from across the creative industries" in response to the proposals, which "illustrates the scale of the threat artists face from artificial intelligence pilfering the fruits of their hard-earned success without permission". She added that making creative works "fair game unless creators say so" was akin to "burglars being allowed into your house unless there's a big sign on your front door expressly telling them that thievery isn't allowed". The letter warned that without this, "the biggest impact would be felt by the long tail of creators and journalists already operating under financial constraints".
Apple AI tool transcribed the word 'racist' as 'Trump'
Videos shared online show people speaking the word "racist" into the Dictation tool. Sometimes it is transcribed correctly - but on other occasions it is turned into "Trump", before being quickly restored to the correct word. The BBC has not been able to replicate the mistake, suggesting Apple's fix is already taking effect. Prof Bell said Apple's explanation of phonetic overlap did not make sense because the two words were not similar enough to confuse an artificial intelligence (AI) system. Speech-to-text recognition models are trained by inputting clips of real people speaking alongside an accurate transcript of what they say.
iPhone voice recognition controversy: 'Racist' converts to 'Trump'
Kevin O'Leary joins "The Brian Kilmeade Show" to discuss working with Frank McCourt to buy TikTok and the dangers of DeepSeek. Have you ever stumbled upon a video on social media that made you question the technology you use every day? That's exactly what happened to me recently, and it led me down a rabbit hole of unexpected discoveries about my iPhone's voice-to-text feature. It all began when I came across a TikTok video claiming that when using Apple's voice-to-text feature, saying the word "racist" would initially result in the word "Trump" being typed before quickly correcting itself. Intrigued and somewhat skeptical, I felt compelled to investigate this claim myself.
Apple to fix iPhone dictation bug that replaces word 'racist' with 'Trump'
Apple has promised to fix a bug in its iPhone automatic dictation tool after some users reported it had suggested to them "Trump" when they said the word "racist". The glitch was first highlighted in a viral post on TikTok, when the speech-to-text tool sometimes briefly flashed up the word "Trump" when they said "racist", and was later repeated by others on social media. "We are aware of an issue with the speech recognition model that powers dictation and we are rolling out a fix," an Apple spokesperson said. The company blamed the bug on its tool displaying words that have "phonetic overlap" before the "intended word" is identified, which in this case included words with the "r" consonant. However, the glitch caused outrage among some conservative commentators in the US, who have long accused big tech companies of political bias against those on the right.
Agentic Mixture-of-Workflows for Multi-Modal Chemical Search
Callahan, Tiffany J., Park, Nathaniel H., Capponi, Sara
The vast and complex materials design space demands innovative strategies to integrate multidisciplinary scientific knowledge and optimize materials discovery. While large language models (LLMs) have demonstrated promising reasoning and automation capabilities across various domains, their application in materials science remains limited due to a lack of benchmarking standards and practical implementation frameworks. To address these challenges, we introduce Mixture-of-Workflows for Self-Corrective Retrieval-Augmented Generation (CRAG-MoW) - a novel paradigm that orchestrates multiple agentic workflows employing distinct CRAG strategies using open-source LLMs. Unlike prior approaches, CRAG-MoW synthesizes diverse outputs through an orchestration agent, enabling direct evaluation of multiple LLMs across the same problem domain. We benchmark CRAG-MoWs across small molecules, polymers, and chemical reactions, as well as multi-modal nuclear magnetic resonance (NMR) spectral retrieval. Our results demonstrate that CRAG-MoWs achieve performance comparable to GPT-4o while being preferred more frequently in comparative evaluations, highlighting the advantage of structured retrieval and multi-agent synthesis. By revealing performance variations across data types, CRAG-MoW provides a scalable, interpretable, and benchmark-driven approach to optimizing AI architectures for materials discovery. These insights are pivotal in addressing fundamental gaps in benchmarking LLMs and autonomous AI agents for scientific applications.
(Mis)Fitting: A Survey of Scaling Laws
Li, Margaret, Kudugunta, Sneha, Zettlemoyer, Luke
Modern foundation models rely heavily on using scaling laws to guide crucial training decisions. Researchers often extrapolate the optimal architecture and hyper parameters settings from smaller training runs by describing the relationship between, loss, or task performance, and scale. All components of this process vary, from the specific equation being fit, to the training setup, to the optimization method. Each of these factors may affect the fitted law, and therefore, the conclusions of a given study. We discuss discrepancies in the conclusions that several prior works reach, on questions such as the optimal token to parameter ratio. We augment this discussion with our own analysis of the critical impact that changes in specific details may effect in a scaling study, and the resulting altered conclusions. Additionally, we survey over 50 papers that study scaling trends: while 45 of these papers quantify these trends using a power law, most under-report crucial details needed to reproduce their findings. To mitigate this, we we propose a checklist for authors to consider while contributing to scaling law research.