Goto

Collaborating Authors

 verification result



Formal Verification of Probabilistic Multi-Agent Systems for Ballistic Rocket Flight Using Probabilistic Alternating-Time Temporal Logic

Kurpiewski, Damian, Michalczyk, Jędrzej, Jamroga, Wojciech, Michalski, Jerzy Julian, Sidoruk, Teofil

arXiv.org Artificial Intelligence

This technical report presents a comprehensive formal verification approach for probabilistic agent systems modeling ballistic rocket flight trajectories using Probabilistic Alternating-Time Temporal Logic (PATL). We describe an innovative verification framework specifically designed for analyzing critical safety properties of ballistic rockets engineered to achieve microgravity conditions for scientific experimentation. Our model integrates authentic flight telemetry data encompassing velocity vectors, pitch angles, attitude parameters, and GPS coordinates to construct probabilistic state transition systems that rigorously account for environmental stochasticity, particularly meteorological variability. We formalize mission-critical safety properties through PATL specifications to systematically identify trajectory deviation states where the rocket risks landing in prohibited or hazardous zones. The verification framework facilitates real-time safety monitoring and enables automated intervention mechanisms, including emergency engine disengagement protocols, when predefined safety thresholds are exceeded. Experimental validation demonstrates the practical effectiveness and reliability of our approach in ensuring mission safety while maintaining scientific mission objectives.

  Country: Europe > Poland (0.04)
  Genre: Research Report (0.64)
  Industry:

MAGMA-Edu: Multi-Agent Generative Multimodal Framework for Text-Diagram Educational Question Generation

Wu, Zhenyu, Li, Jian, Huang, Hua

arXiv.org Artificial Intelligence

Educational illustrations play a central role in communicating abstract concepts, yet current multimodal large language models (MLLMs) remain limited in producing pedagogically coherent and semantically consistent educational visuals. We introduce MAGMA-Edu, a self-reflective multi-agent framework that unifies textual reasoning and diagrammatic synthesis for structured educational problem generation. Unlike existing methods that treat text and image generation independently, MAGMA-Edu employs a two-stage co-evolutionary pipeline: (1) a generation-verification-reflection loop that iteratively refines question statements and solutions for mathematical accuracy, and (2) a code-based intermediate representation that enforces geometric fidelity and semantic alignment during image rendering. Both stages are guided by internal self-reflection modules that evaluate and revise outputs until domain-specific pedagogical constraints are met. Extensive experiments on multimodal educational benchmarks demonstrate the superiority of MAGMA-Edu over state-of-the-art MLLMs. Compared to GPT-4o, MAGMA-Edu improves the average textual metric from 57.01 to 92.31 (+35.3 pp) and boosts image-text consistency (ITC) from 13.20 to 85.24 (+72 pp). Across all model backbones, MAGMA-Edu achieves the highest scores (Avg-Text 96.20, ITC 99.12), establishing a new state of the art for multimodal educational content generation and demonstrating the effectiveness of self-reflective multi-agent collaboration in pedagogically aligned vision-language reasoning.


Towards Continuous Assurance with Formal Verification and Assurance Cases

Abeywickrama, Dhaminda B., Fisher, Michael, Wheeler, Frederic, Dennis, Louise

arXiv.org Artificial Intelligence

Autonomous systems must sustain justified confidence in their correctness and safety across their operational lifecycle-from design and deployment through post-deployment evolution. Traditional assurance methods often separate development-time assurance from runtime assurance, yielding fragmented arguments that cannot adapt to runtime changes or system updates - a significant challenge for assured autonomy. Towards addressing this, we propose a unified Continuous Assurance Framework that integrates design-time, runtime, and evolution-time assurance within a traceable, model-driven workflow as a step towards assured autonomy. In this paper, we specifically instantiate the design-time phase of the framework using two formal verification methods: RoboChart for functional correctness and PRISM for probabilistic risk analysis. We also propose a model-driven transformation pipeline, implemented as an Eclipse plugin, that automatically regenerates structured assurance arguments whenever formal specifications or their verification results change, thereby ensuring traceability. We demonstrate our approach on a nuclear inspection robot scenario, and discuss its alignment with the Trilateral AI Principles, reflecting regulator-endorsed best practices.


A Additional qualitative results

Neural Information Processing Systems

We begin by illustrating successful verification results in Appendix A.1, To further contextualize our TP's advantages, we juxtapose these standard HRs encompass a multitude of verified patches; for visual clarity, we've outlined the SIFT points A.2 Standard verification results: compared with SP Hence, our method suitably ranks these accurate index images highly. We further evaluate our topological verification outcomes against those of the SP method. In addition to successful verification instances, we also explore cases where our method fails. Regions (HRs) identified by our method on ROxford. Regarding false negative cases, our method fails to detect any HRs.


Fact in Fragments: Deconstructing Complex Claims via LLM-based Atomic Fact Extraction and Verification

Zheng, Liwen, Li, Chaozhuo, Liu, Zheng, Huang, Feiran, Jia, Haoran, Ye, Zaisheng, Zhang, Xi

arXiv.org Artificial Intelligence

Fact verification plays a vital role in combating misinformation by assessing the veracity of claims through evidence retrieval and reasoning. However, traditional methods struggle with complex claims requiring multi-hop reasoning over fragmented evidence, as they often rely on static decomposition strategies and surface-level semantic retrieval, which fail to capture the nuanced structure and intent of the claim. This results in accumulated reasoning errors, noisy evidence contamination, and limited adaptability to diverse claims, ultimately undermining verification accuracy in complex scenarios. To address this, we propose A tomic F act E xtraction and V erification (AFEV), a novel framework that iteratively decomposes complex claims into atomic facts, enabling fine-grained retrieval and adaptive reasoning. AFEV dynamically refines claim understanding and reduces error propagation through iterative fact extraction, reranks evidence to filter noise, and leverages context-specific demonstrations to guide the reasoning process. Extensive experiments on five benchmark datasets demonstrate that AFEV achieves state-of-the-art performance in both accuracy and interpretability. Introduction Fact verification is a critical task aimed at assessing the veracity of claims and typically involves two fundamental stages: evidence retrieval and claim verification. During the evidence retrieval phase, relevant pieces of information are identified from large-scale corpora. In the subsequent verification phase, the retrieved evidence is synthesized and analyzed to determine the veracity of the claim [1].


VerifiAgent: a Unified Verification Agent in Language Model Reasoning

Han, Jiuzhou, Buntine, Wray, Shareghi, Ehsan

arXiv.org Artificial Intelligence

Large language models demonstrate remarkable reasoning capabilities but often produce unreliable or incorrect responses. Existing verification methods are typically model-specific or domain-restricted, requiring significant computational resources and lacking scalability across diverse reasoning tasks. To address these limitations, we propose VerifiAgent, a unified verification agent that integrates two levels of verification: meta-verification, which assesses completeness and consistency in model responses, and tool-based adaptive verification, where VerifiAgent autonomously selects appropriate verification tools based on the reasoning type, including mathematical, logical, or commonsense reasoning. This adaptive approach ensures both efficiency and robustness across different verification scenarios. Experimental results show that VerifiAgent outperforms baseline verification methods (e.g., deductive verifier, backward verifier) among all reasoning tasks. Additionally, it can further enhance reasoning accuracy by leveraging feedback from verification results. VerifiAgent can also be effectively applied to inference scaling, achieving better results with fewer generated samples and costs compared to existing process reward models in the mathematical reasoning domain. Code is available at https://github.com/Jiuzhouh/VerifiAgent


Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering

Guan, Xinyan, Liu, Yanjiang, Lu, Xinyu, Cao, Boxi, He, Ben, Han, Xianpei, Sun, Le, Lou, Jie, Yu, Bowen, Lu, Yaojie, Lin, Hongyu

arXiv.org Machine Learning

The evolution of machine learning has increasingly prioritized the development of powerful models and more scalable supervision signals. However, the emergence of foundation models presents significant challenges in providing effective supervision signals necessary for further enhancing their capabilities. Consequently, there is an urgent need to explore novel supervision signals and technical approaches. In this paper, we propose verifier engineering, a novel post-training paradigm specifically designed for the era of foundation models. The core of verifier engineering involves leveraging a suite of automated verifiers to perform verification tasks and deliver meaningful feedback to foundation models. We systematically categorize the verifier engineering process into three essential stages: search, verify, and feedback, and provide a comprehensive review of state-of-the-art research developments within each stage. We believe that verifier engineering constitutes a fundamental pathway toward achieving Artificial General Intelligence.


On the verification of Embeddings using Hybrid Markov Logic

Shakya, Anup, Magar, Abisha Thapa, Sarkhel, Somdeb, Venugopal, Deepak

arXiv.org Artificial Intelligence

The standard approach to verify representations learned by Deep Neural Networks is to use them in specific tasks such as classification or regression, and measure their performance based on accuracy in such tasks. However, in many cases, we would want to verify more complex properties of a learned representation. To do this, we propose a framework based on a probabilistic first-order language, namely, Hybrid Markov Logic Networks (HMLNs) where we specify properties over embeddings mixed with symbolic domain knowledge. We present an approach to learn parameters for the properties within this framework. Further, we develop a verification method to test embeddings in this framework by encoding this task as a Mixed Integer Linear Program for which we can leverage existing state-of-the-art solvers. We illustrate verification in Graph Neural Networks, Deep Knowledge Tracing and Intelligent Tutoring Systems to demonstrate the generality of our approach.


Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Zhou, Aojun, Wang, Ke, Lu, Zimu, Shi, Weikang, Luo, Sichun, Qin, Zipeng, Lu, Shaoqing, Jia, Anya, Song, Linqi, Zhan, Mingjie, Li, Hongsheng

arXiv.org Artificial Intelligence

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the Code Usage Frequency of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit code-based self-verification (CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as "False", the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset (53.9% 84.3%). Large language models (LLMs) (Brown et al., 2020; OpenAI, 2023; Anil et al., 2023) have shown impressive success in various tasks, such as common sense understanding and code generation. However, they still fall short in mathematical reasoning, often producing nonsensical or inaccurate content and struggling with complex calculations. Previous attempts to tackle these challenges include the Chain-of-Thought (CoT) (Wei et al., 2022) framework, which enhances LLMs' logical reasoning abilities by generating intermediate steps in their reasoning process.