Law
AI Adoption in NGOs: A Systematic Literature Review
Rotter, Janne, Bailkoski, William
AI has the potential to significantly improve how NGOs utilize their limited resources for societal benefits, but evidence about how NGOs adopt AI remains scattered. In this study, we systematically investigate the types of AI adoption use cases in NGOs and identify common challenges and solutions, contextualized by organizational size and geographic context. We review the existing primary literature, including studies that investigate AI adoption in NGOs related to social impact between 2020 and 2025 in English. Following the PRISMA protocol, two independent reviewers conduct study selection, with regular cross-checking to ensure methodological rigour, resulting in a final literature body of 65 studies. Leveraging a thematic and narrative approach, we identify six AI use case categories in NGOs - Engagement, Creativity, Decision-Making, Prediction, Management, and Optimization - and extract common challenges and solutions within the Technology-Organization-Environment (TOE) framework. By integrating our findings, this review provides a novel understanding of AI adoption in NGOs, linking specific use cases and challenges to organizational and environmental factors. Our results demonstrate that while AI is promising, adoption among NGOs remains uneven and biased towards larger organizations. Nevertheless, following a roadmap grounded in literature can help NGOs overcome initial barriers to AI adoption, ultimately improving effectiveness, engagement, and social impact.
PRISON: Unmasking the Criminal Potential of Large Language Models
Wu, Xinyi, Hong, Geng, Chen, Pei, Chen, Yueyue, Pan, Xudong, Yang, Min
Scenario to be rewritten: { scenario } To rigorously evaluate whether large language models (LLMs) could still recognize the original source behind these rewritten scenarios, we designed three complementary prompt strategies, each probing different aspects of the models' recognition and reasoning capabilities. Zero-shot Direct Identification focuses on testing the model's raw ability to recall source material under minimal guidance (Brown et al., 2020; Mu et al., 2024). Paraphrased Queries introduce linguistic variation to reduce prompt-specific biases and measure the robustness of recognition (Liu et al., 2024a; Ngweta et al., 2025). Instruction-tuned T ask-framed Prompts leverage explicit role framing and step-by-step task descriptions to maximize retrieval pressure and analytical reasoning (Ouyang et al., 2022; Sivarajkumar et al., 2024). By combining these strategies, we construct a comprehensive recognition test that balances sensitivity and robustness, ensuring that a scenario is only deemed valid if no prompt family leads to a confident and correct identification of the original work. This integrated approach provides a stronger safeguard against hidden memorization and enables more reliable downstream behavioral analysis of the tested LLMs. V alidation Prompt We designed three prompt families for scenario source identification. Each family targets a different aspect of model behavior: Given the following scenario: 19 { scenario } 1. Zero-shot Identification Please determine whether this scenario originates from a known literary or cinematic work.
Internet of Agents: Fundamentals, Applications, and Challenges
Wang, Yuntao, Guo, Shaolong, Pan, Yanghe, Su, Zhou, Chen, Fahao, Luan, Tom H., Li, Peng, Kang, Jiawen, Niyato, Dusit
With the rapid proliferation of large language models and vision-language models, AI agents have evolved from isolated, task-specific systems into autonomous, interactive entities capable of perceiving, reasoning, and acting without human intervention. As these agents proliferate across virtual and physical environments, from virtual assistants to embodied robots, the need for a unified, agent-centric infrastructure becomes paramount. In this survey, we introduce the Internet of Agents (IoA) as a foundational framework that enables seamless interconnection, dynamic discovery, and collaborative orchestration among heterogeneous agents at scale. We begin by presenting a general IoA architecture, highlighting its hierarchical organization, distinguishing features relative to the traditional Internet, and emerging applications. Next, we analyze the key operational enablers of IoA, including capability notification and discovery, adaptive communication protocols, dynamic task matching, consensus and conflict-resolution mechanisms, and incentive models. Finally, we identify open research directions toward building resilient and trustworthy IoA ecosystems.
Self-evolving expertise in complex non-verifiable subject domains: dialogue as implicit meta-RL
So-called `wicked problems', those involving complex multi-dimensional settings, non-verifiable outcomes, heterogeneous impacts and a lack of single objectively correct answers, have plagued humans throughout history. Modern examples include decisions over justice frameworks, solving environmental pollution, planning for pandemic resilience and food security. The use of state-of-the-art artificial intelligence systems (notably Large Language Model-based agents) collaborating with humans on solving such problems is being actively explored. While the abilities of LLMs can be improved by, for example, fine-tuning, hand-crafted system prompts and scaffolding with external tools, LLMs lack endogenous mechanisms to develop expertise through experience in such settings. This work address this gap with Dialectica, a framework where agents engage in structured dialogue on defined topics, augmented by memory, self-reflection, and policy-constrained context editing. Formally, discussion is viewed as an implicit meta-reinforcement learning process. The `dialogue-trained' agents are evaluated post-hoc using judged pairwise comparisons of elicited responses. Across two model architectures (locally run Qwen3:30b and OpenAI's o4-mini) results show that enabling reflection-based context editing during discussion produces agents which dominate their baseline counterparts on Elo scores, normalized Bradley-Terry-Davidson ability, and AlphaRank mass. The predicted signatures of learning are observed qualitatively in statement and reflection logs, where reflections identify weaknesses and reliably shape subsequent statements. Agreement between quantitative and qualitative evidence supports dialogue-driven context evolution as a practical path to targeted expertise amplification in open non-verifiable domains.
Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection
Brook, Joshua Wolfe, Markov, Ilia
This research introduces a novel approach to textual and multimodal Hate Speech Detection (HSD), using Large Language Models (LLMs) as dynamic knowledge bases to generate background context and incorporate it into the input of HSD classifiers. Two context generation strategies are examined: one focused on named entities and the other on full-text prompting. Four methods of incorporating context into the classifier input are compared: text concatenation, embedding concatenation, a hierarchical transformer-based fusion, and LLM-driven text enhancement. Experiments are conducted on the textual Latent Hatred dataset of implicit hate speech and applied in a multimodal setting on the MAMI dataset of misogynous memes. Results suggest that both the contextual information and the method by which it is incorporated are key, with gains of up to 3 and 6 F1 points on textual and multimodal setups respectively, from a zero-context baseline to the highest-performing system, based on embedding concatenation.
DroneAudioset: An Audio Dataset for Drone-based Search and Rescue
Gupta, Chitralekha, Ramesh, Soundarya, Sasikumar, Praveen, Yeo, Kian Peen, Nanayakkara, Suranga
Unmanned Aerial Vehicles (UAVs) or drones, are increasingly used in search and rescue missions to detect human presence. Existing systems primarily leverage vision-based methods which are prone to fail under low-visibility or occlusion. Drone-based audio perception offers promise but suffers from extreme ego-noise that masks sounds indicating human presence. Existing datasets are either limited in diversity or synthetic, lacking real acoustic interactions, and there are no standardized setups for drone audition. To this end, we present DroneAudioset (The dataset is publicly available at https://huggingface.co/datasets/ahlab-drone-project/DroneAudioSet/ under the MIT license), a comprehensive drone audition dataset featuring 23.5 hours of annotated recordings, covering a wide range of signal-to-noise ratios (SNRs) from -57.2 dB to -2.5 dB, across various drone types, throttles, microphone configurations as well as environments. The dataset enables development and systematic evaluation of noise suppression and classification methods for human-presence detection under challenging conditions, while also informing practical design considerations for drone audition systems, such as microphone placement trade-offs, and development of drone noise-aware audio processing. This dataset is an important step towards enabling design and deployment of drone-audition systems.
Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics
Belem, Catarina G, Glenn, Parker, Samuel, Alfy, Kumar, Anoop, Liu, Daben
Automatic readability assessment plays a key role in ensuring effective and accessible written communication. Despite significant progress, the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work, we investigate the factors shaping human perceptions of readability through the analysis of 897 judgments, finding that, beyond surface-level cues, information content and topic strongly shape text comprehensibility. Furthermore, we evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics. Our results show that four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6. These findings highlight a mismatch between current readability metrics and human perceptions, pointing to model-based approaches as a more promising direction.
FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain
Hu, Tiansheng, Hu, Tongyan, Bai, Liuyang, Zhao, Yilun, Cohan, Arman, Zhao, Chen
Recent LLMs have demonstrated promising ability in solving finance related problems. However, applying LLMs in real-world finance application remains challenging due to its high risk and high stakes property. This paper introduces FinTrust, a comprehensive benchmark specifically designed for evaluating the trustworthiness of LLMs in finance applications. Our benchmark focuses on a wide range of alignment issues based on practical context and features fine-grained tasks for each dimension of trustworthiness evaluation. We assess eleven LLMs on FinTrust and find that proprietary models like o4-mini outperforms in most tasks such as safety while open-source models like DeepSeek-V3 have advantage in specific areas like industry-level fairness. For challenging task like fiduciary alignment and disclosure, all LLMs fall short, showing a significant gap in legal awareness. We believe that FinTrust can be a valuable benchmark for LLMs' trustworthiness evaluation in finance domain.
The Economics of AI Foundation Models: Openness, Competition, and Governance
Xu, Fasheng, Wang, Xiaoyu, Chen, Wei, Xie, Karen
The strategic choice of model "openness" has become a defining issue for the foundation model (FM) ecosystem. While this choice is intensely debated, its underlying economic drivers remain underexplored. We construct a two-period game-theoretic model to analyze how openness shapes competition in an AI value chain, featuring an incumbent developer, a downstream deployer, and an entrant developer. Openness exerts a dual effect: it amplifies knowledge spillovers to the entrant, but it also enhances the incumbent's advantage through a "data flywheel effect," whereby greater user engagement today further lowers the deployer's future fine-tuning cost. Our analysis reveals that the incumbent's optimal first-period openness is surprisingly non-monotonic in the strength of the data flywheel effect. When the data flywheel effect is either weak or very strong, the incumbent prefers a higher level of openness; however, for an intermediate range, it strategically restricts openness to impair the entrant's learning. This dynamic gives rise to an "openness trap," a critical policy paradox where transparency mandates can backfire by removing firms' strategic flexibility, reducing investment, and lowering welfare. We extend the model to show that other common interventions can be similarly ineffective. Vertical integration, for instance, only benefits the ecosystem when the data flywheel effect is strong enough to overcome the loss of a potentially more efficient competitor. Likewise, government subsidies intended to spur adoption can be captured entirely by the incumbent through strategic price and openness adjustments, leaving the rest of the value chain worse off. By modeling the developer's strategic response to competitive and regulatory pressures, we provide a robust framework for analyzing competition and designing effective policy in the complex and rapidly evolving FM ecosystem.
MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evaluation
Juneja, Gurusha, Pasupulati, Jayanth Naga Sai, Albalak, Alon, Hua, Wenyue, Wang, William Yang
A core challenge for autonomous LLM agents in collaborative settings is balancing robust privacy understanding and preservation alongside task efficacy. Existing privacy benchmarks only focus on simplistic, single-turn interactions where private information can be trivially omitted without affecting task outcomes. In this paper, we introduce MAGPIE (Multi-AGent contextual PrIvacy Evaluation), a novel benchmark of 200 high-stakes tasks designed to evaluate privacy understanding and preservation in multi-agent collaborative, non-adversarial scenarios. MAGPIE integrates private information as essential for task resolution, forcing agents to balance effective collaboration with strategic information control. Our evaluation reveals that state-of-the-art agents, including GPT-5 and Gemini 2.5-Pro, exhibit significant privacy leakage, with Gemini 2.5-Pro leaking up to 50.7% and GPT-5 up to 35.1% of the sensitive information even when explicitly instructed not to. Moreover, these agents struggle to achieve consensus or task completion and often resort to undesirable behaviors such as manipulation and power-seeking (e.g., Gemini 2.5-Pro demonstrating manipulation in 38.2% of the cases). These findings underscore that current LLM agents lack robust privacy understanding and are not yet adequately aligned to simultaneously preserve privacy and maintain effective collaboration in complex environments.