Goto

Collaborating Authors

 jurisdiction


bc218a0c656e49d4b086975a9c785f47-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing Systems

Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material.



Senators Urge Top Regulator to Stay Out of Prediction Market Lawsuits

WIRED

As prediction market platforms like Polymarket and Kalshi battle regulators in court, Senate Democrats are urging the CFTC to avoid weighing in, escalating a broader fight over the burgeoning industry. Senator Adam Schiff, a Democrat from California, is leading the group of lawmakers urging the CFTC to stay out of state prediction market lawsuits. A group of 23 Democratic US senators sent a letter Friday to the top federal regulator overseeing prediction markets, urging the agency to avoid weighing in on pending court cases over the legality of offerings on the platforms tied to "sports, war, and other prohibited events." Prediction markets, which sell contracts tied to the outcome of real-world developments, have exploded in popularity over the past year, attracting an increasingly mainstream fanbase eager to wager on everything from geopolitical conflicts to fashion choices to the Super Bowl. As they expanded, the platforms have become a magnet for ethical and legal controversies.


Mother of Elon Musk's child sues his AI company over Grok deepfake images

Al Jazeera

X to block Grok AI's undressing feature | Digital Dilemma The mother of one of Elon Musk's children is suing his artificial intelligence company, saying its Grok chatbot allowed users to generate sexually-exploitative deepfake images of her that have caused her humiliation and emotional distress. The lawsuit was filed just before California Attorney General Rob Bonta sent a cease-and-desist letter to Musk's xAI company demanding that it stop the creation and distribution of Grok-generated nonconsensual sexualised imagery . Ashley St Clair, a writer and political commentator, alleges in a lawsuit filed on Thursday in New York City against xAI that she was the victim of sexualised deepfake images generated by Grok. St Clair, who is the mother of Musk's 16-month-old son, Romulus, said she reported the images to Musk's X social media platform, which hosts Grok, after they began appearing last year and asked that they be removed. The platform replied that the images did not violate its policies, she said.


PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Akyürek, Afra Feyza, Gosai, Advait, Zhang, Chen Bo Calvin, Gupta, Vipul, Jeong, Jaehwan, Gunjal, Anisha, Rabbani, Tahseen, Mazzone, Maria, Randolph, David, Meymand, Mohammad Mahmoudi, Chattha, Gurshaan, Rodriguez, Paula, Mares, Diego, Singh, Pavit, Liu, Michael, Chawla, Subodh, Cline, Pete, Ogaz, Lucy, Hernandez, Ernesto, Wang, Zihao, Bhatter, Pavi, Ayestaran, Marcos, Liu, Bing, He, Yunzhong

arXiv.org Artificial Intelligence

Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.


Policy Cards: Machine-Readable Runtime Governance for Autonomous AI Agents

Mavračić, Juraj

arXiv.org Artificial Intelligence

Policy Cards are introduced as a machine-readable, deployment-layer standard for expressing operational, regulatory, and ethical constraints for AI agents. The Policy Card sits with the agent and enables it to follow required constraints at runtime. It tells the agent what it must and must not do. As such, it becomes an integral part of the deployed agent. Policy Cards extend existing transparency artifacts such as Model, Data, and System Cards by defining a normative layer that encodes allow/deny rules, obligations, evidentiary requirements, and crosswalk mappings to assurance frameworks including NIST AI RMF, ISO/IEC 42001, and the EU AI Act. Each Policy Card can be validated automatically, version-controlled, and linked to runtime enforcement or continuous-audit pipelines. The framework enables verifiable compliance for autonomous agents, forming a foundation for distributed assurance in multi-agent ecosystems. Policy Cards provide a practical mechanism for integrating high-level governance with hands-on engineering practice and enabling accountable autonomy at scale.


Ask What Your Country Can Do For You: Towards a Public Red Teaming Model

Kennedy, Wm. Matthew, Patlak, Cigdem, Dave, Jayraj, Chambers, Blake, Dhanotiya, Aayush, Ramiah, Darshini, Schwartz, Reva, Hagen, Jack, Kundu, Akash, Pendharkar, Mouni, Baisley, Liam, Skeadas, Theodora, Chowdhury, Rumman

arXiv.org Artificial Intelligence

AI systems have the potential to produce both benefits and harms, but without rigorous and ongoing adversarial evaluation, AI actors will struggle to assess the breadth and magnitude of the AI risk surface. Researchers from the field of systems design have developed several effective sociotechnical AI evaluation and red teaming techniques targeting bias, hate speech, mis/disinformation, and other documented harm classes. However, as increasingly sophisticated AI systems are released into high-stakes sectors (such as education, healthcare, and intelligence-gathering), our current evaluation and monitoring methods are proving less and less capable of delivering effective oversight. In order to actually deliver responsible AI and to ensure AI's harms are fully understood and its security vulnerabilities mitigated, pioneering new approaches to close this "responsibility gap" are now more urgent than ever. In this paper, we propose one such approach, the cooperative public AI red-teaming exercise, and discuss early results of its prior pilot implementations. This approach is intertwined with CAMLIS itself: the first in-person public demonstrator exercise was held in conjunction with CAMLIS 2024. We review the operational design and results of this exercise, the prior National Institute of Standards and Technology (NIST)'s Assessing the Risks and Impacts of AI (ARIA) pilot exercise, and another similar exercise conducted with the Singapore Infocomm Media Development Authority (IMDA). Ultimately, we argue that this approach is both capable of delivering meaningful results and is also scalable to many AI developing jurisdictions.



L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search

Wang, Ziqi, Yuan, Boqin

arXiv.org Artificial Intelligence

We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a system that reduces hallucination and uncertainty in legal question answering through coordinated multi-agent reasoning and retrieval. Unlike single-pass retrieval-augmented generation (RAG), L-MARS decomposes queries into subproblems, issues targeted searches across heterogeneous sources (Serper web, local RAG, CourtListener case law), and employs a Judge Agent to verify sufficiency, jurisdiction, and temporal validity before answer synthesis. This iterative reasoning-search-verification loop maintains coherence, filters noisy evidence, and grounds answers in authoritative law. We evaluated L-MARS on LegalSearchQA, a new benchmark of 200 up-to-date multiple choice legal questions in 2025. Results show that L-MARS substantially improves factual accuracy, reduces uncertainty, and achieves higher preference scores from both human experts and LLM-based judges. Our work demonstrates that multi-agent reasoning with agentic search offers a scalable and reproducible blueprint for deploying LLMs in high-stakes domains requiring precise legal retrieval and deliberation.


Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset Peter Henderson

Neural Information Processing Systems

Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material.