adjudicator
A Self-Improving Architecture for Dynamic Safety in Large Language Models
Context: The integration of Large Language Models (LLMs) into core software systems is accelerating. However, existing software architecture patterns are static, while current safety assurance methods are not scalable, leaving systems vulnerable to novel adversarial threats. Objective: To design, implement, and evaluate a novel software architecture that enables an AI-driven system to autonomously and continuously adapt its own safety protocols at runtime. Method: We propose the Self-Improving Safety Framework (SISF), a runtime architecture that couples an unprotected, unaligned base LLM (mistralai/Mistral-7B-v0.1) with a dynamic feedback loop. This loop consists of an AI Adjudicator (GPT-4o) for breach detection and a Policy Synthesis Module (GPT-4 Turbo) that autonomously generates new, generalized safety policies (both heuristic and semantic) in response to failures. Results: We conducted a dynamic learning evaluation using the 520-prompt AdvBench dataset. The unprotected model was 100% vulnerable. Our SISF, starting from zero policies, demonstrated a clear learning curve: it detected 237 breaches, autonomously synthesized 234 new policies, and reduced the overall Attack Success Rate (ASR) to 45.58%. In a subsequent test on 520 benign prompts, the SISF achieved a 0.00% False Positive Rate (FPR), proving its ability to adapt without compromising user utility. Conclusion: An architectural approach to AI safety, based on the principles of self-adaptation, is a viable and effective strategy. Our framework demonstrates a practical path towards building more robust, resilient, and scalable AI-driven systems, shifting safety assurance from a static, pre-deployment activity to an automated, runtime process.
Shall We Play a Game? Language Models for Open-ended Wargames
Matlin, Glenn, Mahajan, Parv, Song, Isaac, Hao, Yixiong, Bard, Ryan, Topp, Stu, Montoya, Evan, Parwani, M. Rehan, Shetty, Soham, Riedl, Mark
Wargames are simulations of conflicts in which participants' decisions influence future events. While casual wargaming can be used for entertainment or socialization, serious wargaming is used by experts to explore strategic implications of decision-making and experiential learning. In this paper, we take the position that Artificial Intelligence (AI) systems, such as Language Models (LMs), are rapidly approaching human-expert capability for strategic planning -- and will one day surpass it. Military organizations have begun using LMs to provide insights into the consequences of real-world decisions during _open-ended wargames_ which use natural language to convey actions and outcomes. We argue the ability for AI systems to influence large-scale decisions motivates additional research into the safety, interpretability, and explainability of AI in open-ended wargames. To demonstrate, we conduct a scoping literature review with a curated selection of 100 unclassified studies on AI in wargames, and construct a novel ontology of open-endedness using the creativity afforded to players, adjudicators, and the novelty provided to observers. Drawing from this body of work, we distill a set of practical recommendations and critical safety considerations for deploying AI in open-ended wargames across common domains. We conclude by presenting the community with a set of high-impact open research challenges for future work.
Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction
Srivastava, Saurabh, Yao, Ziyu
Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has also led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions and produce accurate outputs. In this work, we aim to systematically study this open question, using the structured task of event extraction for a case study. We experimented with two LRMs (DeepSeek-R1 and o1) and two general-purpose Large Language Models (LLMs) (GPT-4o and GPT-4.5), when they were used as task models or prompt optimizers. Our results show that on tasks as complicated as event extraction, LRMs as task models still benefit from prompt optimization, and that using LRMs as prompt optimizers yields more effective prompts. Our finding also generalizes to tasks beyond event extraction. Finally, we provide an error analysis of common errors made by LRMs and highlight the stability and consistency of LRMs in refining task instructions and event guidelines.
ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding
Hasegawa, Kimihiro, Imrattanatrai, Wiradee, Cheng, Zhi-Qi, Asada, Masaki, Holm, Susan, Wang, Yuran, Fukuda, Ken, Mitamura, Teruko
Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios. ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities coupled with their corresponding instruction. For QA annotation, we take a cost-effective human-LLM collaborative approach, where the existing annotation is augmented with LLM-generated QA pairs that are later verified by humans. We then provide the benchmark results to set the baseline performance on ProMQA. Our experiment reveals a significant gap between human performance and that of current systems, including competitive proprietary multimodal models. We hope our dataset sheds light on new aspects of models' multimodal understanding capabilities.
GPT-4 is judged more human than humans in displaced and inverted Turing tests
Rathi, Ishika, Taylor, Sydney, Bergen, Benjamin K., Jones, Cameron R.
Everyday AI detection requires differentiating between people and AI in informal, online conversations. In many cases, people will not interact directly with AI systems but instead read conversations between AI systems and other people. We measured how well people and large language models can discriminate using Figure 1: A summary of our experimental design. Transcripts two modified versions of the Turing test: inverted were sampled from an interactive Turing test, and displaced. GPT-3.5, GPT-4, and where a human judge interrogates a witness to determine displaced human adjudicators judged whether if they are human or AI. In an inverted Turing test, an agent was human or AI on the basis of a we present transcripts to AI models, who judge whether Turing test transcript. We found that both AI the same witnesses are human or AI. In a displaced and displaced human judges were less accurate Turing test, a separate group of human participants read than interactive interrogators, with below the same transcripts and make this judgement.
Social Cue Detection and Analysis Using Transfer Entropy
Jiang, Haoyang, Croft, Elizabeth A., Burke, Michael G.
Robots that work close to humans need to understand and use social cues to act in a socially acceptable manner. Social cues are a form of communication (i.e., information flow) between people. In this paper, a framework is introduced to detect and analyse a class of perceptible social cues that are nonverbal and episodic, and the related information transfer using an information-theoretic measure, namely, transfer entropy. We use a group-joining setting to demonstrate the practicality of transfer entropy for analysing communications between humans. Then we demonstrate the framework in two settings involving social interactions between humans: object-handover and person-following. Our results show that transfer entropy can identify information flows between agents and when and where they occur. Potential applications of the framework include information flow or social cue analysis for interactive robot design and socially-aware robot planning.
Event Extraction as Question Generation and Answering
Lu, Di, Ran, Shihao, Tetreault, Joel, Jaimes, Alejandro
Recent work on Event Extraction has reframed the task as Question Answering (QA), with promising results. The advantage of this approach is that it addresses the error propagation issue found in traditional token-based classification approaches by directly predicting event arguments without extracting candidates first. However, the questions are typically based on fixed templates and they rarely leverage contextual information such as relevant arguments. In addition, prior QA-based approaches have difficulty handling cases where there are multiple arguments for the same role. In this paper, we propose QGA-EE, which enables a Question Generation (QG) model to generate questions that incorporate rich contextual information instead of using fixed templates. We also propose dynamic templates to assist the training of QG model. Experiments show that QGA-EE outperforms all prior single-task-based models on the ACE05 English dataset.
Retrieval-Augmented Generative Question Answering for Event Argument Extraction
Event argument extraction has long been studied as a sequential prediction problem with extractive-based methods, tackling each argument in isolation. Although recent work proposes generation-based methods to capture cross-argument dependency, they require generating and post-processing a complicated target sequence (template). Motivated by these observations and recent pretrained language models' capabilities of learning from demonstrations. We propose a retrieval-augmented generative QA model (R-GQA) for event argument extraction. It retrieves the most similar QA pair and augments it as prompt to the current example's context, then decodes the arguments as answers. Our approach outperforms substantially prior methods across various settings (i.e. fully supervised, domain transfer, and fewshot learning). Finally, we propose a clustering-based sampling strategy (JointEnc) and conduct a thorough analysis of how different strategies influence the few-shot learning performance. The implementations are available at https:// github.com/xinyadu/RGQA
Bi-Directional Iterative Prompt-Tuning for Event Argument Extraction
Dai, Lu, Wang, Bang, Xiang, Wei, Mo, Yijun
Recently, prompt-tuning has attracted growing interests in event argument extraction (EAE). However, the existing prompt-tuning methods have not achieved satisfactory performance due to the lack of consideration of entity information. In this paper, we propose a bi-directional iterative prompt-tuning method for EAE, where the EAE task is treated as a cloze-style task to take full advantage of entity information and pre-trained language models (PLMs). Furthermore, our method explores event argument interactions by introducing the argument roles of contextual entities into prompt construction. Since template and verbalizer are two crucial components in a cloze-style prompt, we propose to utilize the role label semantic knowledge to construct a semantic verbalizer and design three kinds of templates for the EAE task. Experiments on the ACE 2005 English dataset with standard and low-resource settings show that the proposed method significantly outperforms the peer state-of-the-art methods. Our code is available at https://github.com/HustMinsLab/BIP.
DEGREE: A Data-Efficient Generative Event Extraction Model
Hsu, I-Hung, Huang, Kuan-Hao, Boschee, Elizabeth, Miller, Scott, Natarajan, Prem, Chang, Kai-Wei, Peng, Nanyun
Event extraction (EE) aims to identify structured events, including event triggers and their corresponding arguments, from unstructured text. Most of the existing works rely on a large number of labeled instances to train models, while the labeled data could be expensive to be obtained. In this work, we present a data-efficient event extraction method by formulating event extraction as a natural language generation problem. The formulation allows us to inject knowledge of label semantics, event structure, and output dependencies into the model. Given a passage and an event type, our model learns to summarize this passage into a templated sentence in a predefined structure. The template is event-type-specific, manually created, and contains event trigger and argument information. Lastly, a rule-based algorithm is used to derive the trigger and argument predictions from the generated sentence. Our method inherently enjoys the following benefits: (1) The pretraining of the generative language models help incorporate the semantics of the labels for generative EE. (2) The autoregressive generation process and our end-to-end design for extracting triggers and arguments force the model to capture the dependencies among the output triggers and their arguments. (3) The predefined templates form concrete yet flexible rules to hint the models about the valid patterns for each event type, reducing the models' burden to learn structures from the data. Empirical results show that our model achieves superior performance over strong baselines on EE tasks in the low data regime and achieves competitive results to the current state-of-the-art when more data becomes available.