Goto

Collaborating Authors

 jurisdiction



Senators Urge Top Regulator to Stay Out of Prediction Market Lawsuits

WIRED

As prediction market platforms like Polymarket and Kalshi battle regulators in court, Senate Democrats are urging the CFTC to avoid weighing in, escalating a broader fight over the burgeoning industry. Senator Adam Schiff, a Democrat from California, is leading the group of lawmakers urging the CFTC to stay out of state prediction market lawsuits. A group of 23 Democratic US senators sent a letter Friday to the top federal regulator overseeing prediction markets, urging the agency to avoid weighing in on pending court cases over the legality of offerings on the platforms tied to "sports, war, and other prohibited events." Prediction markets, which sell contracts tied to the outcome of real-world developments, have exploded in popularity over the past year, attracting an increasingly mainstream fanbase eager to wager on everything from geopolitical conflicts to fashion choices to the Super Bowl. As they expanded, the platforms have become a magnet for ethical and legal controversies.


PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Akyürek, Afra Feyza, Gosai, Advait, Zhang, Chen Bo Calvin, Gupta, Vipul, Jeong, Jaehwan, Gunjal, Anisha, Rabbani, Tahseen, Mazzone, Maria, Randolph, David, Meymand, Mohammad Mahmoudi, Chattha, Gurshaan, Rodriguez, Paula, Mares, Diego, Singh, Pavit, Liu, Michael, Chawla, Subodh, Cline, Pete, Ogaz, Lucy, Hernandez, Ernesto, Wang, Zihao, Bhatter, Pavi, Ayestaran, Marcos, Liu, Bing, He, Yunzhong

arXiv.org Artificial Intelligence

Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.


Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset Peter Henderson

Neural Information Processing Systems

Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material.


Policy Cards: Machine-Readable Runtime Governance for Autonomous AI Agents

Mavračić, Juraj

arXiv.org Artificial Intelligence

Policy Cards are introduced as a machine-readable, deployment-layer standard for expressing operational, regulatory, and ethical constraints for AI agents. The Policy Card sits with the agent and enables it to follow required constraints at runtime. It tells the agent what it must and must not do. As such, it becomes an integral part of the deployed agent. Policy Cards extend existing transparency artifacts such as Model, Data, and System Cards by defining a normative layer that encodes allow/deny rules, obligations, evidentiary requirements, and crosswalk mappings to assurance frameworks including NIST AI RMF, ISO/IEC 42001, and the EU AI Act. Each Policy Card can be validated automatically, version-controlled, and linked to runtime enforcement or continuous-audit pipelines. The framework enables verifiable compliance for autonomous agents, forming a foundation for distributed assurance in multi-agent ecosystems. Policy Cards provide a practical mechanism for integrating high-level governance with hands-on engineering practice and enabling accountable autonomy at scale.


Ask What Your Country Can Do For You: Towards a Public Red Teaming Model

Kennedy, Wm. Matthew, Patlak, Cigdem, Dave, Jayraj, Chambers, Blake, Dhanotiya, Aayush, Ramiah, Darshini, Schwartz, Reva, Hagen, Jack, Kundu, Akash, Pendharkar, Mouni, Baisley, Liam, Skeadas, Theodora, Chowdhury, Rumman

arXiv.org Artificial Intelligence

AI systems have the potential to produce both benefits and harms, but without rigorous and ongoing adversarial evaluation, AI actors will struggle to assess the breadth and magnitude of the AI risk surface. Researchers from the field of systems design have developed several effective sociotechnical AI evaluation and red teaming techniques targeting bias, hate speech, mis/disinformation, and other documented harm classes. However, as increasingly sophisticated AI systems are released into high-stakes sectors (such as education, healthcare, and intelligence-gathering), our current evaluation and monitoring methods are proving less and less capable of delivering effective oversight. In order to actually deliver responsible AI and to ensure AI's harms are fully understood and its security vulnerabilities mitigated, pioneering new approaches to close this "responsibility gap" are now more urgent than ever. In this paper, we propose one such approach, the cooperative public AI red-teaming exercise, and discuss early results of its prior pilot implementations. This approach is intertwined with CAMLIS itself: the first in-person public demonstrator exercise was held in conjunction with CAMLIS 2024. We review the operational design and results of this exercise, the prior National Institute of Standards and Technology (NIST)'s Assessing the Risks and Impacts of AI (ARIA) pilot exercise, and another similar exercise conducted with the Singapore Infocomm Media Development Authority (IMDA). Ultimately, we argue that this approach is both capable of delivering meaningful results and is also scalable to many AI developing jurisdictions.



L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search

Wang, Ziqi, Yuan, Boqin

arXiv.org Artificial Intelligence

We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a system that reduces hallucination and uncertainty in legal question answering through coordinated multi-agent reasoning and retrieval. Unlike single-pass retrieval-augmented generation (RAG), L-MARS decomposes queries into subproblems, issues targeted searches across heterogeneous sources (Serper web, local RAG, CourtListener case law), and employs a Judge Agent to verify sufficiency, jurisdiction, and temporal validity before answer synthesis. This iterative reasoning-search-verification loop maintains coherence, filters noisy evidence, and grounds answers in authoritative law. We evaluated L-MARS on LegalSearchQA, a new benchmark of 200 up-to-date multiple choice legal questions in 2025. Results show that L-MARS substantially improves factual accuracy, reduces uncertainty, and achieves higher preference scores from both human experts and LLM-based judges. Our work demonstrates that multi-agent reasoning with agentic search offers a scalable and reproducible blueprint for deploying LLMs in high-stakes domains requiring precise legal retrieval and deliberation.


OLG++: A Semantic Extension of Obligation Logic Graph

Dasgupta, Subhasis, Stephens, Jon, Gupta, Amarnath

arXiv.org Artificial Intelligence

We present OLG++, a semantic extension of the Obligation Logic Graph (OLG) for modeling regulatory and legal rules in municipal and interjurisdictional contexts. OLG++ introduces richer node and edge types, including spatial, temporal, party group, defeasibility, and logical grouping constructs, enabling nuanced representations of legal obligations, exceptions, and hierarchies. The model supports structured reasoning over rules with contextual conditions, precedence, and complex triggers. We demonstrate its expressiveness through examples from food business regulations, showing how OLG++ supports legal question answering using property graph queries. OLG++ also improves over LegalRuleML by providing native support for subClassOf, spatial constraints, and reified exception structures. Our examples show that OLG++ is more expressive than prior graph-based models for legal knowledge representation.


IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders

Deshmukh, Sneha, Kamble, Prathmesh

arXiv.org Artificial Intelligence

Legal NLP remains underdeveloped in regions like India due to the scarcity of structured datasets. We introduce IndianBailJudgments-1200, a new benchmark dataset comprising 1200 Indian court judgments on bail decisions, annotated across 20+ attributes including bail outcome, IPC sections, crime type, and legal reasoning. Annotations were generated using a prompt-engineered GPT-4o pipeline and verified for consistency. This resource supports a wide range of legal NLP tasks such as outcome prediction, summarization, and fairness analysis, and is the first publicly available dataset focused specifically on Indian bail jurisprudence.