python script
Toward Automated and Trustworthy Scientific Analysis and Visualization with LLM-Generated Code
Chakroborti, Apu Kumar, Ding, Yi, Wan, Lipeng
As modern science becomes increasingly data-intensive, the ability to analyze and visualize large-scale, complex datasets is critical to accelerating discovery. However, many domain scientists lack the programming expertise required to develop custom data analysis workflows, creating barriers to timely and effective insight. Large language models (LLMs) offer a promising solution by generating executable code from natural language descriptions. In this paper, we investigate the trustworthiness of open-source LLMs in autonomously producing Python scripts for scientific data analysis and visualization. We construct a benchmark suite of domain-inspired prompts that reflect real-world research tasks and systematically evaluate the executability and correctness of the generated code. Our findings show that, without human intervention, the reliability of LLM-generated code is limited, with frequent failures caused by ambiguous prompts and the models' insufficient understanding of domain-specific contexts. To address these challenges, we design and assess three complementary strategies: data-aware prompt disambiguation, retrieval-augmented prompt enhancement, and iterative error repair. While these methods significantly improve execution success rates and output quality, further refinement is needed. This work highlights both the promise and current limitations of LLM-driven automation in scientific workflows and introduces actionable techniques and a reusable benchmark for building more inclusive, accessible, and trustworthy AI-assisted research tools.
- Government (0.70)
- Health & Medicine (0.46)
- Information Technology (0.46)
ChatGPT Unveils Its Limits: Principles of Law Deliver Checkmate
Molinari, Marianna, Amantea, Ilaria Angela, Quaranta, Marinella, Governatori, Guido
This study examines the performance of ChatGPT with an experiment in the legal domain. We compare the outcome with it a baseline using regular expressions (Regex), rather than focusing solely on the assessment against human performance. The study reveals that even if ChatGPT has access to the necessary knowledge and competencies, it is unable to assemble them, reason through, in a way that leads to an exhaustive result. This unveils a major limitation of ChatGPT. Intelligence encompasses the ability to break down complex issues and address them according to multiple required competencies, providing a unified and comprehensive solution. In the legal domain, one of the most crucial tasks is reading legal decisions and extracting key passages condensed from principles of law (PoLs), which are then incorporated into subsequent rulings by judges or defense documents by lawyers. In performing this task, artificial intelligence lacks an all-encompassing understanding and reasoning, which makes it inherently limited. Genuine intelligence, remains a uniquely human trait, at least in this particular field.
- Oceania > Australia > Queensland (0.04)
- Europe > Italy > Piedmont > Turin Province > Turin (0.04)
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
- Europe > Belgium (0.04)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.48)
Text-to-Layout: A Generative Workflow for Drafting Architectural Floor Plans Using LLMs
Duggempudi, Jayakrishna, Gao, Lu, Senouci, Ahmed, Han, Zhe, Zhang, Yunpeng
This paper presents the development of an AI-powered workflow that uses Large Language Models (LLMs) to assist in drafting schematic architectural floor plans from natural language prompts. The proposed system interprets textual input to automatically generate layout options including walls, doors, windows, and furniture arrangements. It combines prompt engineering, a furniture placement refinement algorithm, and Python scripting to produce spatially coherent draft plans compatible with design tools such as Autodesk Revit. A case study of a mid-sized residential layout demonstrates the approach's ability to generate functional and structured outputs with minimal manual effort. The workflow is designed for transparent replication, with all key prompt specifications documented to enable independent implementation by other researchers. In addition, the generated models preserve the full range of Revit-native parametric attributes required for direct integration into professional BIM processes.
Cyber-Zero: Training Cybersecurity Agents without Runtime
Zhuo, Terry Yue, Wang, Dingmin, Ding, Hantian, Kumar, Varun, Wang, Zijian
Large Language Models (LLMs) have achieved remarkable success in software engineering tasks when trained with executable runtime environments, particularly in resolving GitHub issues. However, such runtime environments are often unavailable in other domains, especially cybersecurity, where challenge configurations and execution contexts are ephemeral or restricted. We present Cyber-Zero, the first runtime-free framework for synthesizing high-quality agent trajectories to train cybersecurity LLMs. Cyber-Zero leverages publicly available CTF writeups and employs persona-driven LLM simulation to reverse-engineer runtime behaviors and generate realistic, long-horizon interaction sequences without actual environments. Using trajectories synthesized by Cyber-Zero, we train LLM-based agents that achieve up to 13.1% absolute performance gains over baseline models on three prominent CTF benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best model, Cyber-Zero-32B, establishes new state-of-the-art performance among open-weight models, matching the capabilities of proprietary systems like DeepSeek-V3-0324 and Claude-3.5-Sonnet while offering superior cost-effectiveness, and demonstrating that runtime-free trajectory synthesis can effectively democratize the development of state-of-the-art cybersecurity agents.
- Europe > Switzerland > Basel-City > Basel (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Africa > Mozambique > Gaza Province > Xai-Xai (0.04)
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (0.89)
Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models
Zheng, Yujia, Li, Tianhao, Huang, Haotian, Zeng, Tianyu, Lu, Jingyu, Chu, Chuangxin, Huang, Yuekai, Jiang, Ziyou, Xiong, Qian, Ge, Yuyao, Li, Mingyang
Prompt-based adversarial attacks have become an effective means to assess the robustness of large language models (LLMs). However, existing approaches often treat prompts as monolithic text, overlooking their structural heterogeneity-different prompt components contribute unequally to adversarial robustness. Prior works like PromptRobust assume prompts are value-neutral, but our analysis reveals that complex, domain-specific prompts with rich structures have components with differing vulnerabilities. To address this gap, we introduce PromptAnatomy, an automated framework that dissects prompts into functional components and generates diverse, interpretable adversarial examples by selectively perturbing each component using our proposed method, ComPerturb. To ensure linguistic plausibility and mitigate distribution shifts, we further incorporate a perplexity (PPL)-based filtering mechanism. As a complementary resource, we annotate four public instruction-tuning datasets using the PromptAnatomy framework, verified through human review. Extensive experiments across these datasets and five advanced LLMs demonstrate that ComPerturb achieves state-of-the-art attack success rates. Ablation studies validate the complementary benefits of prompt dissection and PPL filtering. Our results underscore the importance of prompt structure awareness and controlled perturbation for reliable adversarial robustness evaluation in LLMs. Code and data are available at https://github.com/Yujiaaaaa/PACP.
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area (0.69)
Large Language Model-Driven Code Compliance Checking in Building Information Modeling
Madireddy, Soumya, Gao, Lu, Din, Zia, Kim, Kinam, Senouci, Ahmed, Han, Zhe, Zhang, Yunpeng
This research addresses the time-consuming and error-prone nature of manual code compliance checking in Building Information Modeling (BIM) by introducing a Large Language Model (LLM)-driven approach to semi-automate this critical process. The developed system integrates LLMs such as GPT, Claude, Gemini, and Llama, with Revit software to interpret building codes, generate Python scripts, and perform semi-automated compliance checks within the BIM environment. Case studies on a single-family residential project and an office building project demonstrated the system's ability to reduce the time and effort required for compliance checks while improving accuracy. It streamlined the identification of violations, such as non-compliant room dimensions, material usage, and object placements, by automatically assessing relationships and generating actionable reports. Compared to manual methods, the system eliminated repetitive tasks, simplified complex regulations, and ensured reliable adherence to standards. By offering a comprehensive, adaptable, and cost-effective solution, this proposed approach offers a promising advancement in BIM-based compliance checking, with potential applications across diverse regulatory documents in construction projects.
- North America > United States > Texas > Harris County > Houston (0.05)
- North America > United States > California (0.04)
- Europe > Italy > Apulia > Bari (0.04)
- (2 more...)
- Law (1.00)
- Construction & Engineering (1.00)
SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems
He, Xu, Wu, Di, Zhai, Yan, Sun, Kun
The rise of large language model (LLM)-based multi-agent systems (MAS) introduces new security and reliability challenges. While these systems show great promise in decomposing and coordinating complex tasks, they also face multi-faceted risks across prompt manipulation, unsafe tool usage, and emergent agent miscoordination. Existing guardrail mechanisms offer only partial protection, primarily at the input-output level, and fall short in addressing systemic or multi-point failures in MAS. In this work, we present a system-level anomaly detection framework tailored for MAS, integrating structural modeling with runtime behavioral oversight. Our approach consists of two components. First, we propose a graph-based framework that models agent interactions as dynamic execution graphs, enabling semantic anomaly detection at node, edge, and path levels. Second, we introduce a pluggable SentinelAgent, an LLM-powered oversight agent that observes, analyzes, and intervenes in MAS execution based on security policies and contextual reasoning. By bridging abstract detection logic with actionable enforcement, our method detects not only single-point faults and prompt injections but also multi-agent collusion and latent exploit paths. We validate our framework through two case studies, including an email assistant and Microsoft's Magentic-One system, demonstrating its ability to detect covert risks and provide explainable root-cause attribution. Our work lays the foundation for more trustworthy, monitorable, and secure agent-based AI ecosystems.
- Research Report (0.64)
- Overview (0.46)
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL
Feng, Yichen, Xu, Zhangchen, Jiang, Fengqing, Li, Yuetai, Ramasubramanian, Bhaskar, Niu, Luyao, Lin, Bill Yuchen, Poovendran, Radha
Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.
- North America > United States (0.28)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Singapore (0.04)
- (2 more...)
Multi-Agent Systems Execute Arbitrary Malicious Code
Triedman, Harold, Jha, Rishi, Shmatikov, Vitaly
Multi-agent systems coordinate LLM-based agents to perform tasks on users' behalf. In real-world applications, multi-agent systems will inevitably interact with untrusted inputs, such as malicious Web content, files, email attachments, etc. Using several recently proposed multi-agent frameworks as concrete examples, we demonstrate that adversarial content can hijack control and communication within the system to invoke unsafe agents and functionalities. This results in a complete security breach, up to execution of arbitrary malicious code on the user's device and/or exfiltration of sensitive data from the user's containerized environment. We show that control-flow hijacking attacks succeed even if the individual agents are not susceptible to direct or indirect prompt injection, and even if they refuse to perform harmful actions.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- (4 more...)
- Research Report (0.84)
- Workflow (0.67)
Multi-Agent Actor-Critic Generative AI for Query Resolution and Analysis
Rahman, Mohammad Wali Ur, Nevarez, Ric, Mim, Lamia Tasnim, Hariri, Salim
In this paper, we introduce MASQRAD (Multi-Agent Strategic Query Resolution and Diagnostic tool), a transformative framework for query resolution based on the actor-critic model, which utilizes multiple generative AI agents. MASQRAD is excellent at translating imprecise or ambiguous user inquiries into precise and actionable requests. This framework generates pertinent visualizations and responses to these focused queries, as well as thorough analyses and insightful interpretations for users. MASQRAD addresses the common shortcomings of existing solutions in domains that demand fast and precise data interpretation, such as their incapacity to successfully apply AI for generating actionable insights and their challenges with the inherent ambiguity of user queries. MASQRAD functions as a sophisticated multi-agent system but "masquerades" to users as a single AI entity, which lowers errors and enhances data interaction. This approach makes use of three primary AI agents: Actor Generative AI, Critic Generative AI, and Expert Analysis Generative AI. Each is crucial for creating, enhancing, and evaluating data interactions. The Actor AI generates Python scripts to generate data visualizations from large datasets within operational constraints, and the Critic AI rigorously refines these scripts through multi-agent debate. Finally, the Expert Analysis AI contextualizes the outcomes to aid in decision-making. With an accuracy rate of 87\% when handling tasks related to natural language visualization, MASQRAD establishes new benchmarks for automated data interpretation and showcases a noteworthy advancement that has the potential to revolutionize AI-driven applications.
- North America > United States > California (0.14)
- North America > United States > Arizona > Pima County > Tucson (0.14)
- North America > United States > Ohio (0.04)
- (2 more...)
- Research Report (1.00)
- Overview > Innovation (0.46)
- Information Technology > Security & Privacy (0.67)
- Government > Regional Government (0.46)