Law
EthicAlly: a Prototype for AI-Powered Research Ethics Support for the Social Sciences and Humanities
In biomedical science, review by a Research Ethics Committee (REC) is an indispensable way of protecting human subjects from harm. However, in social science and the humanities, mandatory ethics compliance has long been met with scepticism as biomedical models of ethics can map poorly onto methodologies involving complex socio-political and cultural considerations. As a result, tailored ethics training and support as well as access to RECs with the necessary expertise is lacking in some areas, including parts of Europe and low- and middle-income countries. This paper suggests that Generative AI can meaningfully contribute to closing these gaps, illustrating this claim by presenting EthicAlly, a proof-of-concept prototype for an AI-powered ethics support system for social science and humanities researchers. Drawing on constitutional AI technology and a collaborative prompt development methodology, EthicAlly provides structured ethics assessment that incorporates both universal ethics principles and contextual and interpretive considerations relevant to most social science research. In supporting researchers in ethical research design and preparation for REC submission, this kind of system can also contribute to easing the burden on institutional RECs, without attempting to automate or replace human ethical oversight.
The Attribution Crisis in LLM Search Results
Strauss, Ilan, Yang, Jangho, O'Reilly, Tim, Rosenblat, Sruly, Moure, Isobel
Web-enabled LLMs frequently answer queries without crediting the web pages they consume, creating an "attribution gap" - the difference between relevant URLs read and those actually cited. Drawing on approximately 14,000 real-world LMArena conversation logs with search-enabled LLM systems, we document three exploitation patterns: 1) No Search: 34% of Google Gemini and 24% of OpenAI GPT-4o responses are generated without explicitly fetching any online content; 2) No citation: Gemini provides no clickable citation source in 92% of answers; 3) High-volume, low-credit: Perplexity's Sonar visits approximately 10 relevant pages per query but cites only three to four. A negative binomial hurdle model shows that the average query answered by Gemini or Sonar leaves about 3 relevant websites uncited, whereas GPT-4o's tiny uncited gap is best explained by its selective log disclosures rather than by better attribution. Citation efficiency - extra citations provided per additional relevant web page visited - varies widely across models, from 0.19 to 0.45 on identical queries, underscoring that retrieval design, not technical limits, shapes ecosystem impact. We recommend a transparent LLM search architecture based on standardized telemetry and full disclosure of search traces and citation logs.
Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes
Jiao, Rui, Zhang, Yue, Li, Jinku
We present a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) an enhanced Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability method examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across multi state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. Our approach significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our neural activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.
Can Memory-Augmented LLM Agents Aid Journalism in Interpreting and Framing News for Diverse Audiences?
Modern news is often comprehensive, weaving together information from diverse domains, including technology, finance, and agriculture. This very comprehensiveness creates a challenge for interpretation, as audiences typically possess specialized knowledge related to their expertise, age, or standpoint. Consequently, a reader might fully understand the financial implications of a story but fail to grasp or even actively misunderstand its legal or technological dimensions, resulting in critical comprehension gaps. In this work, we investigate how to identify these comprehension gaps and provide solutions to improve audiences' understanding of news content, particularly in the aspects of articles outside their primary domains of knowledge. We propose MADES, an agent-based framework designed to simulate societal communication. The framework utilizes diverse agents, each configured to represent a specific occupation or age group. Each agent is equipped with a memory system. These agents are then simulated to discuss the news. This process enables us to monitor and analyze their behavior and cognitive processes. Our findings indicate that the framework can identify confusions and misunderstandings within news content through its iterative discussion process. Based on these accurate identifications, the framework then designs supplementary material. We validated these outcomes using both statistical analysis and human evaluation, and the results show that agents exhibit significantly improved news understanding after receiving this supplementary material.
LLMs on Trial: Evaluating Judicial Fairness for Large Language Models
Hu, Yiran, Xue, Zongyue, Li, Haitao, Zheng, Siyuan, Chen, Qingjing, Wang, Shaochun, Zhang, Xihan, Zheng, Ning, Liu, Yun, Ai, Qingyao, Liu, Yiqun, Clarke, Charles L. A., Shen, Weixing
Large Language Models (LLMs) are increasingly used in high-stakes fields where their decisions impact rights and equity. However, LLMs' judicial fairness and implications for social justice remain underexplored. When LLMs act as judges, the ability to fairly resolve judicial issues is a prerequisite to ensure their trustworthiness. Based on theories of judicial fairness, we construct a comprehensive framework to measure LLM fairness, leading to a selection of 65 labels and 161 corresponding values. Applying this framework to the judicial system, we compile an extensive dataset, JudiFair, comprising 177,100 unique case facts. To achieve robust statistical inference, we develop three evaluation metrics, inconsistency, bias, and imbalanced inaccuracy, and introduce a method to assess the overall fairness of multiple LLMs across various labels. Through experiments with 16 LLMs, we uncover pervasive inconsistency, bias, and imbalanced inaccuracy across models, underscoring severe LLM judicial unfairness. Particularly, LLMs display notably more pronounced biases on demographic labels, with slightly less bias on substance labels compared to procedure ones. Interestingly, increased inconsistency correlates with reduced biases, but more accurate predictions exacerbate biases. While we find that adjusting the temperature parameter can influence LLM fairness, model size, release date, and country of origin do not exhibit significant effects on judicial fairness. Accordingly, we introduce a publicly available toolkit containing all datasets and code, designed to support future research in evaluating and improving LLM fairness.
From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents
Petrova, Tatiana, Bliznioukov, Boris, Puzikov, Aleksandr, State, Radu
The concept of the Web of Agents (WoA), which transforms the static, document-centric Web into an environment of autonomous agents acting on users' behalf, has attracted growing interest as large language models (LLMs) become more capable. However, research in this area is still fragmented across different communities. Contemporary surveys catalog the latest LLM-powered frameworks, while the rich histories of Multi-Agent Systems (MAS) and the Semantic Web are often treated as separate, legacy domains. This fragmentation obscures the intellectual lineage of modern systems and hinders a holistic understanding of the field's trajectory. We present the first comprehensive evolutionary overview of the WoA. We show that modern protocols like A2A and the MCP, are direct evolutionary responses to the well-documented limitations of earlier standards like FIPA standards and OWL-based semantic agents. To systematize this analysis, we introduce a four-axis taxonomy (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism). This framework provides a unified analytical lens for comparing agent architectures across all generations, revealing a clear line of descent where others have seen a disconnect. Our analysis identifies a paradigm shift in the 'locus of intelligence': from being encoded in external data (Semantic Web) or the platform (MAS) to being embedded within the agent's core model (LLM). This shift is foundational to modern Agentic AI, enabling the scalable and adaptive systems the WoA has long envisioned. We conclude that while new protocols are essential, they are insufficient for building a robust, open, trustworthy ecosystem. Finally, we argue that the next research frontier lies in solving persistent socio-technical challenges, and we map out a new agenda focused on decentralized identity, economic models, security, and governance for the emerging WoA.
Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices
Cruz, Luรญs, Fernandes, Joรฃo Paulo, Kirkeby, Maja H., Martรญnez-Fernรกndez, Silverio, Sallou, June, Anwar, Hina, Roque, Enrique Barba, Bogner, Justus, Castaรฑo, Joel, Castor, Fernando, Chasmawala, Aadil, Cunha, Simรฃo, Feitosa, Daniel, Gonzรกlez, Alexandra, Jedlitschka, Andreas, Lago, Patricia, Muccini, Henry, Oprescu, Ana, Rani, Pooja, Saraiva, Joรฃo, Sarro, Federica, Selvan, Raghavendra, Vaidhyanathan, Karthik, Verdecchia, Roberto, Yamshchikov, Ivan P.
The environmental impact of Artificial Intelligence (AI)-enabled systems is increasing rapidly, and software engineering plays a critical role in developing sustainable solutions. The "Greening AI with Software Engineering" CECAM-Lorentz workshop (no. 1358, 2025) funded by the Centre Europรฉen de Calcul Atomique et Molรฉculaire and the Lorentz Center, provided an interdisciplinary forum for 29 participants, from practitioners to academics, to share knowledge, ideas, practices, and current results dedicated to advancing green software and AI research. The workshop was held February 3-7, 2025, in Lausanne, Switzerland. Through keynotes, flash talks, and collaborative discussions, participants identified and prioritized key challenges for the field. These included energy assessment and standardization, benchmarking practices, sustainability-aware architectures, runtime adaptation, empirical methodologies, and education. This report presents a research agenda emerging from the workshop, outlining open research directions and practical recommendations to guide the development of environmentally sustainable AI-enabled systems rooted in software engineering principles.
It's High Time: A Survey of Temporal Question Answering
Piryani, Bhawna, Abdallah, Abdelrahman, Mozafari, Jamshid, Anand, Avishek, Jatowt, Adam
Time plays a critical role in how information is generated, retrieved, and interpreted. In this survey, we provide a comprehensive overview of Temporal Question Answering (TQA), a research area that focuses on answering questions involving temporal constraints or context. As the amount of time-stamped content from sources like news articles, web archives, and knowledge bases increases, systems must address challenges such as detecting temporal intent, normalizing time expressions, ordering events, and reasoning over evolving or ambiguous facts. We focus on recent advances in TQA enabled by neural architectures, especially transformer-based models and Large Language Models (LLMs), highlighting progress in temporal language modeling, retrieval-augmented generation (RAG), and temporal reasoning. We also discuss benchmark datasets and evaluation strategies designed to test temporal robustness, recency awareness, and generalization.
Investigating Robotaxi Crash Severity with Geographical Random Forest and the Urban Environment
Jiao, Junfeng, Baik, Seung Gyu, Choi, Seung Jun, Xu, Yiming
This paper quantitatively investigates the crash severity of Autonomous Vehicles (AVs) with spatially localized machine learning and macroscopic measures of the urban built environment. Extending beyond the microscopic effects of individual infrastructure elements, we focus on the city-scale land use and behavioral patterns, while addressing spatial heterogeneity and spatial autocorrelation. We implemented a spatially localized machine learning technique called Geographical Random Forest (GRF) on the California AV collision dataset. Analyzing multiple urban measures, including points of interest, building footprint, and land use, we built a GRF model and visualized it as a crash severity risk map of San Francisco. This paper presents three findings. First, spatially localized machine learning outperformed regular machine learning in predicting AV crash severity. The bias-variance tradeoff was evident as we adjusted the localization weight hyperparameter. Second, land use was the most important predictor, compared to intersections, building footprints, public transit stops, and Points Of Interest (POIs). Third, AV crashes were more likely to result in low-severity incidents in city center areas with greater diversity and commercial activities, than in residential neighborhoods. Residential land use is likely associated with higher severity due to human behavior and less restrictive environments. Counterintuitively, residential areas were associated with higher crash severity, compared to more complex areas such as commercial and mixed-use areas. When robotaxi operators train their AV systems, it is recommended to: (1) consider where their fleet operates and make localized algorithms for their perception system, and (2) design safety measures specific to residential neighborhoods, such as slower driving speeds and more alert sensors.
Federated Graph Unlearning
Ai, Yuming, Li, Xunkai, Chao, Jiaqi, Fan, Bowen, Wu, Zhengyu, Zhu, Yinlin, Li, Rong-Hua, Wang, Guoren
The demand for data privacy has led to the development of frameworks like Federated Graph Learning (FGL), which facilitate decentralized model training. However, a significant operational challenge in such systems is adhering to the right to be forgotten. This principle necessitates robust mechanisms for two distinct types of data removal: the selective erasure of specific entities and their associated knowledge from local subgraphs and the wholesale removal of a user's entire dataset and influence. Existing methods often struggle to fully address both unlearning requirements, frequently resulting in incomplete data removal or the persistence of residual knowledge within the system. This work introduces a unified framework, conceived to provide a comprehensive solution to these challenges. The proposed framework employs a bifurcated strategy tailored to the specific unlearning request. For fine-grained Meta Unlearning, it uses prototype gradients to direct the initial local forgetting process, which is then refined by generating adversarial graphs to eliminate any remaining data traces among affected clients. In the case of complete client unlearning, the framework utilizes adversarial graph generation exclusively to purge the departed client's contributions from the remaining network. Extensive experiments on multiple benchmark datasets validate the proposed approach. The framework achieves substantial improvements in model prediction accuracy across both client and meta-unlearning scenarios when compared to existing methods. Furthermore, additional studies confirm its utility as a plug-in module, where it materially enhances the predictive capabilities and unlearning effectiveness of other established methods.