Law
Finance Language Model Evaluation (FLaME)
Matlin, Glenn, Okamoto, Mika, Pardawala, Huzaifa, Yang, Yang, Chava, Sudheer
Language Models (LMs) have demonstrated impressive capabilities with core Natural Language Processing (NLP) tasks. The effectiveness of LMs for highly specialized knowledge-intensive tasks in finance remains difficult to assess due to major gaps in the methodologies of existing evaluation frameworks, which have caused an erroneous belief in a far lower bound of LMs' performance on common Finance NLP (FinNLP) tasks. To demonstrate the potential of LMs for these FinNLP tasks, we present the first holistic benchmarking suite for Financial Language Model Evaluation (FLaME). We are the first research paper to comprehensively study LMs against 'reasoning-reinforced' LMs, with an empirical study of 23 foundation LMs over 20 core NLP tasks in finance. We open-source our framework software along with all data and results.
'Wall-E With a Gun': Midjourney Generates Videos of Disney Characters Amid Massive Copyright Lawsuit
It's been a busy month for Midjourney. This week, the generative AI startup released its sophisticated new video tool, V1, which lets users make short animated clips from images they generate or upload. The current version of Midjourney's AI video tool requires an image as a starting point; generating videos using text-only prompts is not supported. Midjourney did not immediately respond to requests for comment. Disney and Universal reiterated statements made by its executives about the lawsuit, including Disney's legal head Horacio Gutierrez alleging that Midjourney's output amounts to "piracy."
BBC threatens AI firm with legal action over unauthorised content use
The BBC's legal threat has been made in a letter to Perplexity's boss Aravind Srinivas. The BBC also cited its research published earlier this year that found four popular AI chatbots - including Perplexity AI - were inaccurately summarising news stories, including some BBC content. Pointing to findings of significant issues with representation of BBC content in some Perplexity AI responses analysed, it said such output fell short of BBC Editorial Guidelines around the provision of impartial and accurate news. "It is therefore highly damaging to the BBC, injuring the BBC's reputation with audiences - including UK licence fee payers who fund the BBC - and undermining their trust in the BBC," it added.
This May Be Trump's Most Consequential Decision Yet
This week, Emily Bazelon, John Dickerson, and David Plotz discuss whether the US should join Israel's war on Iran, the tragic Minnesota assassinations and why US political violence is surging now, and the Supreme Court's unsurprising but willfully obtuse decision to uphold Tennessee's youth transgender care ban. Here are some notes and references from this week's show: Alexander Ward, Lara Seligman, and Dustin Volz for The Wall Street Journal (Exclusive): Israel Built Its Case for War With Iran on New Intelligence. The U.S. Didn't Buy It. Thomas L. Friedman for The New York Times (Opinion): The Smart Way for Trump to End the Israel-Iran War Oren Cass for Understanding America (Substack): Is Israel the Ideal "America First" Ally? Warren P. Strobel, Alex Horton, and Abigail Hauslohner for the Washington Post: Navigating Iran crisis, Trump relies on experience over star power Amy Howe for SCOTUSblog: Court upholds Tennessee's ban on certain medical treatments for transgender minors Abbie VanSickle for The New York Times: Sotomayor Writes the Court'Abandons' Transgender Children to'Political Whims' Ella Lee for The Hill: Clarence Thomas urges courts to end deferring to'experts' on gender-affirming care Ian Millhiser for Vox: The Supreme Court's incoherent new attack on trans rights, explained Here are this week's chatters: Emily: A Family Matter by Claire Lynch; The Fall of Affirmative Action: Race, the Supreme Court, and the Future of Higher Education by Justin Driver; A Flower Traveled in My Blood: The Incredible True Story of the Grandmothers Who Fought to Find a Stolen Generation of Children by Haley Cohen Gilliland. John: Mary Cunningham for CBS News: Federal Reserve holds its benchmark interest rate steady at today's FOMC meeting; ABA Banking Journal: Fed's Powell says some areas of U.S. may be'uninsurable' in next decade David: Trip Gabriel for the New York Times: William Langewiesche, the'Steve McQueen of Journalism,' Dies at 70 For this week's Slate Plus bonus episode, Emily, John, and David discuss the exciting possibilities and likely limitations of using AI tools for historical research and writing.
How 3D-printed guns are spreading online
We did not proceed with the transaction to test Jessy's claims. While his casual attitude suggested he might have been a scammer, his ability to advertise on Meta and operate on Telegram highlights apparent loopholes that real gun dealers could exploit. When contacted, Meta told the BBC that the adverts we highlighted had been "automatically disabled in line with our policies", and that inclusion in its ad library "doesn't necessarily mean the ad is still live or visible". Telegram said that Jessy's account had been proactively removed for breaching its policies. A spokesperson added: "The sale of weapons is explicitly forbidden by Telegram's terms of service and is removed whenever discovered. Moderators empowered with custom AI and machine learning tools proactively monitor public parts of the platform and accept reports in order to remove millions of pieces of harmful content each day, including the sale of weapons."
Identifying social isolation themes in NVDRS text narratives using topic modeling and text-classification methods
Walker, Drew, Rajwal, Swati, Das, Sudeshna, Peddireddy, Snigdha, Sarker, Abeed
Social isolation and loneliness, which have been increasing in recent years strongly contribute toward suicide rates. Although social isolation and loneliness are not currently recorded within the US National Violent Death Reporting System's (NVDRS) structured variables, natural language processing (NLP) techniques can be used to identify these constructs in law enforcement and coroner medical examiner narratives. Using topic modeling to generate lexicon development and supervised learning classifiers, we developed high-quality classifiers (average F1: .86, accuracy: .82). Evaluating over 300,000 suicides from 2002 to 2020, we identified 1,198 mentioning chronic social isolation. Decedents had higher odds of chronic social isolation classification if they were men (OR = 1.44; CI: 1.24, 1.69, p<.0001), gay (OR = 3.68; 1.97, 6.33, p<.0001), or were divorced (OR = 3.34; 2.68, 4.19, p<.0001). We found significant predictors for other social isolation topics of recent or impending divorce, child custody loss, eviction or recent move, and break-up. Our methods can improve surveillance and prevention of social isolation and loneliness in the United States.
Preparing for the Intelligence Explosion
MacAskill, William, Moorhouse, Fin
AI that can accelerate research could drive a century of technological progress over just a few years. During such a period, new technological or political developments will raise consequential and hard-to-reverse decisions, in rapid succession. We call these developments grand challenges. These challenges include new weapons of mass destruction, AI-enabled autocracies, races to grab offworld resources, and digital beings worthy of moral consideration, as well as opportunities to dramatically improve quality of life and collective decision-making. We argue that these challenges cannot always be delegated to future AI systems, and suggest things we can do today to meaningfully improve our prospects. AGI preparedness is therefore not just about ensuring that advanced AI systems are aligned: we should be preparing, now, for the disorienting range of developments an intelligence explosion would bring.
Position Paper: Rethinking Privacy in RL for Sequential Decision-making in the Age of LLMs
Fan, Flint Xiaofeng, Tan, Cheston, Wattenhofer, Roger, Ong, Yew-Soon
The rise of reinforcement learning (RL) in critical real-world applications demands a fundamental rethinking of privacy in AI systems. Traditional privacy frameworks, designed to protect isolated data points, fall short for sequential decision-making systems where sensitive information emerges from temporal patterns, behavioral strategies, and collaborative dynamics. Modern RL paradigms, such as federated RL (FedRL) and RL with human feedback (RLHF) in large language models (LLMs), exacerbate these challenges by introducing complex, interactive, and context-dependent learning environments that traditional methods do not address. In this position paper, we argue for a new privacy paradigm built on four core principles: multi-scale protection, behavioral pattern protection, collaborative privacy preservation, and context-aware adaptation. These principles expose inherent tensions between privacy, utility, and interpretability that must be navigated as RL systems become more pervasive in high-stakes domains like healthcare, autonomous vehicles, and decision support systems powered by LLMs. To tackle these challenges, we call for the development of new theoretical frameworks, practical mechanisms, and rigorous evaluation methodologies that collectively enable effective privacy protection in sequential decision-making systems.
Unlocking Post-hoc Dataset Inference with Synthetic Data
Zhao, Bihe, Maini, Pratyush, Boenisch, Franziska, Dziedzic, Adam
The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners' intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset's distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method's reliability for real-world litigations. Our code is available at https://github.com/sprintml/PostHocDatasetInference.
A Comparative Study of Task Adaptation Techniques of Large Language Models for Identifying Sustainable Development Goals
Cadeddu, Andrea, Chessa, Alessandro, De Leo, Vincenzo, Fenu, Gianni, Motta, Enrico, Osborne, Francesco, Recupero, Diego Reforgiato, Salatino, Angelo, Secchi, Luca
In 2012, the United Nations introduced 17 Sustainable Development Goals (SDGs) aimed at creating a more sustainable and improved future by 2030. However, tracking progress toward these goals is difficult because of the extensive scale and complexity of the data involved. Text classification models have become vital tools in this area, automating the analysis of vast amounts of text from a variety of sources. Additionally, large language models (LLMs) have recently proven indispensable for many natural language processing tasks, including text classification, thanks to their ability to recognize complex linguistic patterns and semantics. This study analyzes various proprietary and open-source LLMs for a single-label, multi-class text classification task focused on the SDGs. Then, it also evaluates the effectiveness of task adaptation techniques (i.e., in-context learning approaches), namely Zero-Shot and Few-Shot Learning, as well as Fine-Tuning within this domain. The results reveal that smaller models, when optimized through prompt engineering, can perform on par with larger models like OpenAI's GPT (Generative Pre-trained Transformer).