Law
The Consistency-Acceptability Divergence of LLMs in Judicial Decision-Making: Task and Stakeholder Dimensions
The integration of large language model (LLM) technology into judicial systems is fundamentally transforming legal practice worldwide. However, this global transformation has revealed an urgent paradox requiring immediate attention. This study introduces the concept of ``consistency-acceptability divergence'' for the first time, referring to the gap between technical consistency and social acceptance. While LLMs achieve high consistency at the technical level, this consistency demonstrates both positive and negative effects. Through comprehensive analysis of recent data on LLM judicial applications from 2023--2025, this study finds that addressing this challenge requires understanding both task and stakeholder dimensions. This study proposes the Dual-Track Deliberative Multi-Role LLM Judicial Governance Framework (DTDMR-LJGF), which enables intelligent task classification and meaningful interaction among diverse stakeholders. This framework offers both theoretical insights and practical guidance for building an LLM judicial ecosystem that balances technical efficiency with social legitimacy.
Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact
Qureshi, Rizwan, Sapkota, Ranjan, Shah, Abbas, Muneer, Amgad, Zafar, Anas, Vayani, Ashmal, Shoman, Maged, Eldaly, Abdelrahman B. M., Zhang, Kai, Sadak, Ferhat, Raza, Shaina, Fan, Xinqi, Shwartz-Ziv, Ravid, Yan, Hong, Jain, Vinjia, Chadha, Aman, Karkee, Manoj, Wu, Jia, Mirjalili, Seyedali
Can machines truly think, reason and act in domains like humans? This enduring question continues to shape the pursuit of Artificial General Intelligence (AGI). Despite the growing capabilities of models such as GPT-4.5, DeepSeek, Claude 3.5 Sonnet, Phi-4, and Grok 3, which exhibit multimodal fluency and partial reasoning, these systems remain fundamentally limited by their reliance on token-level prediction and lack of grounded agency. This paper offers a cross-disciplinary synthesis of AGI development, spanning artificial intelligence, cognitive neuroscience, psychology, generative models, and agent-based systems. We analyze the architectural and cognitive foundations of general intelligence, highlighting the role of modular reasoning, persistent memory, and multi-agent coordination. In particular, we emphasize the rise of Agentic RAG frameworks that combine retrieval, planning, and dynamic tool use to enable more adaptive behavior. We discuss generalization strategies, including information compression, test-time adaptation, and training-free methods, as critical pathways toward flexible, domain-agnostic intelligence. Vision-Language Models (VLMs) are reexamined not just as perception modules but as evolving interfaces for embodied understanding and collaborative task completion. We also argue that true intelligence arises not from scale alone but from the integration of memory and reasoning: an orchestration of modular, interactive, and self-improving components where compression enables adaptive behavior. Drawing on advances in neurosymbolic systems, reinforcement learning, and cognitive scaffolding, we explore how recent architectures begin to bridge the gap between statistical learning and goal-directed cognition. Finally, we identify key scientific, technical, and ethical challenges on the path to AGI.
Judges Don't Know What AI's Book Piracy Means
More than 40 lawsuits have been filed against AI companies since 2022. Late last month, there were rulings on two of these cases, first in a lawsuit against Anthropic and, two days later, in one against Meta. Both of the cases were brought by book authors who alleged that AI companies had trained large language models using authors' work without consent or compensation. In each case, the judges decided that the tech companies were engaged in "fair use" when they trained their models with authors' books. Both judges said that the use of these books was "transformative"--that training an LLM resulted in a fundamentally different product that does not directly compete with those books.
AI 'Nudify' Websites Are Raking in Millions of Dollars
For years, so-called "nudify" apps and websites have mushroomed online, allowing people to create nonconsensual and abusive images of women and girls, including child sexual abuse material. Despite some lawmakers and tech companies taking steps to limit the harmful services, every month, millions of people are still accessing the websites, and the sites' creators may be making millions of dollars each year, new research suggests. An analysis of 85 nudify and "undress" websites--which allow people to upload photos and use AI to generate "nude" pictures of the subjects with just a few clicks--has found that most of the sites rely on tech services from Google, Amazon, and Cloudflare to operate and stay online. The findings, revealed by Indicator, a publication investigating digital deception, say that the websites had a combined average of 18.5 million visitors for each of the past six months and collectively may be making up to 36 million per year. Alexios Mantzarlis, a cofounder of Indicator and an online safety researcher, says the murky nudifier ecosystem has become a "lucrative business" that "Silicon Valley's laissez-faire approach to generative AI" has allowed to persist.
Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences
Heller, Selina, Ibrahim, Mohamed, Selby, David Antony, Vollmer, Sebastian
Decision conferences are structured, collaborative meetings that bring together experts from various fields to address complex issues and reach a consensus on recommendations for future actions or policies. These conferences often rely on facilitated discussions to ensure productive dialogue and collective agreement. Recently, Large Language Models (LLMs) have shown significant promise in simulating real-world scenarios, particularly through collaborative multi-agent systems that mimic group interactions. In this work, we present a novel LLM-based multi-agent system designed to simulate decision conferences, specifically focusing on detecting agreement among the participant agents. To achieve this, we evaluate six distinct LLMs on two tasks: stance detection, which identifies the position an agent takes on a given issue, and stance polarity detection, which identifies the sentiment as positive, negative, or neutral. These models are further assessed within the multi-agent system to determine their effectiveness in complex simulations. Our results indicate that LLMs can reliably detect agreement even in dynamic and nuanced debates. Incorporating an agreement-detection agent within the system can also improve the efficiency of group debates and enhance the overall quality and coherence of deliberations, making them comparable to real-world decision conferences regarding outcome and decision-making. These findings demonstrate the potential for LLM-based multi-agent systems to simulate group decision-making processes. They also highlight that such systems could be instrumental in supporting decision-making with expert elicitation workshops across various domains.
A Comprehensive Survey on Deep Learning Solutions for 3D Flood Mapping
Jia, Wenfeng, Liang, Bin, Liu, Yuxi, Khan, Muhammad Arif, Zheng, Lihong
Flooding remains a major global challenge, worsened by climate change and urbanization, demanding advanced solutions for effective disaster management. While traditional 2D flood mapping techniques provide limited insights, 3D flood mapping, powered by deep learning (DL), offers enhanced capabilities by integrating flood extent and depth. This paper presents a comprehensive survey of deep learning-based 3D flood mapping, emphasizing its advancements over 2D maps by integrating flood extent and depth for effective disaster management and urban planning. The survey categorizes deep learning techniques into task decomposition and end-to-end approaches, applicable to both static and dynamic flood features. We compare key DL architectures, highlighting their respective roles in enhancing prediction accuracy and computational efficiency. Additionally, this work explores diverse data sources such as digital elevation models, satellite imagery, rainfall, and simulated data, outlining their roles in 3D flood mapping. The applications reviewed range from real-time flood prediction to long-term urban planning and risk assessment. However, significant challenges persist, including data scarcity, model interpretability, and integration with traditional hydrodynamic models. This survey concludes by suggesting future directions to address these limitations, focusing on enhanced datasets, improved models, and policy implications for flood management. This survey aims to guide researchers and practitioners in leveraging DL techniques for more robust and reliable 3D flood mapping, fostering improved flood management strategies.
AI Safety Should Prioritize the Future of Work
Hazra, Sanchaita, Majumder, Bodhisattwa Prasad, Chakrabarty, Tuhin
Current efforts in AI safety prioritize filtering harmful content, preventing manipulation of human behavior, and eliminating existential risks in cybersecurity or biosecurity. While pressing, this narrow focus overlooks critical human-centric considerations that shape the long-term trajectory of a society. In this position paper, we identify the risks of overlooking the impact of AI on the future of work and recommend comprehensive transition support towards the evolution of meaningful labor with human agency. Through the lens of economic theories, we highlight the intertemporal impacts of AI on human livelihood and the structural changes in labor markets that exacerbate income inequality. Additionally, the closed-source approach of major stakeholders in AI development resembles rent-seeking behavior through exploiting resources, breeding mediocrity in creative labor, and monopolizing innovation. To address this, we argue in favor of a robust international copyright anatomy supported by implementing collective licensing that ensures fair compensation mechanisms for using data to train AI models. We strongly recommend a pro-worker framework of global AI governance to enhance shared prosperity and economic justice while reducing technical debt.
Generating Proto-Personas through Prompt Engineering: A Case Study on Efficiency, Effectiveness and Empathy
Ayach, Fernando, Lameirรฃo, Vitor, Leรฃo, Raul, Felizardo, Jerfferson, Sobrinho, Rafael, Borges, Vanessa, Matsubara, Patrรญcia, Fontรฃo, Awdren
Proto-personas are commonly used during early-stage Product Discovery, such as Lean Inception, to guide product definition and stakeholder alignment. However, the manual creation of proto-personas is often time-consuming, cognitively demanding, and prone to bias. In this paper, we propose and empirically investigate a prompt engineering-based approach to generate proto-personas with the support of Generative AI (GenAI). Our goal is to evaluate the approach in terms of efficiency, effectiveness, user acceptance, and the empathy elicited by the generated personas. We conducted a case study with 19 participants embedded in a real Lean Inception, employing a qualitative and quantitative methods design. The results reveal the approach's efficiency by reducing time and effort and improving the quality and reusability of personas in later discovery phases, such as Minimum Viable Product (MVP) scoping and feature refinement. While acceptance was generally high, especially regarding perceived usefulness and ease of use, participants noted limitations related to generalization and domain specificity. Furthermore, although cognitive empathy was strongly supported, affective and behavioral empathy varied significantly across participants. These results contribute novel empirical evidence on how GenAI can be effectively integrated into software Product Discovery practices, while also identifying key challenges to be addressed in future iterations of such hybrid design processes.
Using Large Language Models for Legal Decision-Making in Austrian Value-Added Tax Law: An Experimental Study
Luketina, Marina, Benkel, Andrea, Schuetz, Christoph G.
This paper provides an experimental evaluation of the capability of large language models (LLMs) to assist in legal decision-making within the framework of Austrian and European Union value-added tax (VAT) law. In tax consulting practice, clients often describe cases in natural language, making LLMs a prime candidate for supporting automated decision-making and reducing the workload of tax professionals. Given the requirement for legally grounded and well-justified analyses, the propensity of LLMs to hallucinate presents a considerable challenge. The experiments focus on two common methods for enhancing LLM performance: fine-tuning and retrieval-augmented generation (RAG). In this study, these methods are applied on both textbook cases and real-world cases from a tax consulting firm to systematically determine the best configurations of LLM-based systems and assess the legal-reasoning capabilities of LLMs. The findings highlight the potential of using LLMs to support tax consultants by automating routine tasks and providing initial analyses, although current prototypes are not ready for full automation due to the sensitivity of the legal domain. The findings indicate that LLMs, when properly configured, can effectively support tax professionals in VAT tasks and provide legally grounded justifications for decisions. However, limitations remain regarding the handling of implicit client knowledge and context-specific documentation, underscoring the need for future integration of structured background information.
The Curious Case of Factuality Finetuning: Models' Internal Beliefs Can Improve Factuality
Newman, Benjamin, Ravichander, Abhilasha, Jung, Jaehun, Xin, Rui, Ivison, Hamish, Kuznetsov, Yegor, Koh, Pang Wei, Choi, Yejin
Language models are prone to hallucination - generating text that is factually incorrect. Finetuning models on high-quality factual information can potentially reduce hallucination, but concerns remain; obtaining factual gold data can be expensive and training on correct but unfamiliar data may potentially lead to even more downstream hallucination. What data should practitioners finetune on to mitigate hallucinations in language models? In this work, we study the relationship between the factuality of finetuning data and the prevalence of hallucinations in long-form generation tasks. Counterintuitively, we find that finetuning on factual gold data is not as helpful as finetuning on model-generated data that models believe to be factual. Next, we evaluate filtering strategies applied on both factual gold data and model-generated data, and find that finetuning on model-generated data that is filtered by models' own internal judgments often leads to better overall factuality compared to other configurations: training on gold data filtered by models' judgments, training on gold data alone, or training on model-generated data that is supported by gold data. These factuality improvements transfer across three domains we study, suggesting that a models' own beliefs can provide a powerful signal for factuality.