log file
Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting
Arora, Nirmit, Joel, Sathvik, Kavathekar, Ishan, Palak, null, Gandhi, Rohan, Pandya, Yash, Ganu, Tanuja, Kanade, Aditya, Nambi, Akshay
LLM-based agents are increasingly deployed in multi-agent systems (MAS). As these systems move toward real-world applications, their security becomes paramount. Existing research largely evaluates single-agent security, leaving a critical gap in understanding the vulnerabilities introduced by multi-agent design. However, existing systems fall short due to lack of unified frameworks and metrics focusing on unique rejection modes in MAS. We present SafeAgents, a unified and extensible framework for fine-grained security assessment of MAS. SafeAgents systematically exposes how design choices such as plan construction strategies, inter-agent context sharing, and fallback behaviors affect susceptibility to adversarial prompting. We introduce Dharma, a diagnostic measure that helps identify weak links within multi-agent pipelines. Using SafeAgents, we conduct a comprehensive study across five widely adopted multi-agent architectures (centralized, decentralized, and hybrid variants) on four datasets spanning web tasks, tool use, and code generation. Our findings reveal that common design patterns carry significant vulnerabilities. For example, centralized systems that delegate only atomic instructions to sub-agents obscure harmful objectives, reducing robustness. Our results highlight the need for security-aware design in MAS. Link to code is https://github.com/microsoft/SafeAgents
Supplementary Material: Learning Distilled Collaboration Graph for Multi-Agent Perception
V ehicles are spawned in CARLA via SUMO, and managed by the Traffic Manager. We employ the dataset format of the nuScenes and extend it to multi-agent scenarios, seen in Fig. IV. Each log file can produce 100 scenes, and each scene includes 100 frames. The input BEV map's dimension is (c, w,h) = (13, 256, 256). II.1 Architecture of student/teacher encoder We describe the architecture of the encoder below.
- Transportation > Passenger (1.00)
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks > Manufacturer (1.00)
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents
Naik, Akshat, Quinn, Patrick, Bosch, Guillermo, Gouné, Emma, Zabala, Francisco Javier Campos, Brown, Jason Ross, Young, Edward James
As Large Language Model (LLM) agents become more widespread, associated misalignment risks increase. While prior research has studied agents' ability to produce harmful outputs or follow malicious instructions, it remains unclear how likely agents are to spontaneously pursue unintended goals in realistic deployments. In this work, we approach misalignment as a conflict between the internal goals pursued by the model and the goals intended by its deployer. We introduce a misalignment propensity benchmark, \textsc{AgentMisalignment}, a benchmark suite designed to evaluate the propensity of LLM agents to misalign in realistic scenarios. Evaluations cover behaviours such as avoiding oversight, resisting shutdown, sandbagging, and power-seeking. Testing frontier models, we find that more capable agents tend to exhibit higher misalignment on average. We also systematically vary agent personalities through different system prompts and observe that persona characteristics can strongly and unpredictably influence misalignment, sometimes more than the choice of model itself. Our results reveal the limitations of current alignment methods for autonomous LLM agents and underscore the need to rethink misalignment in realistic deployment settings.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Virginia (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.92)
- Government (0.67)
SchemaCoder: Automatic Log Schema Extraction Coder with Residual Q-Tree Boosting
Wan, Lily Jiaxin, Ho, Chia-Tung, Liang, Rongjian, Yu, Cunxi, Chen, Deming, Ren, Haoxing
Log schema extraction is the process of deriving human-readable templates from massive volumes of log data, which is essential yet notoriously labor-intensive. Recent studies have attempted to streamline this task by leveraging Large Language Models (LLMs) for automated schema extraction. However, existing methods invariably rely on predefined regular expressions, necessitating human domain expertise and severely limiting productivity gains. To fundamentally address this limitation, we introduce SchemaCoder, the first fully automated schema extraction framework applicable to a wide range of log file formats without requiring human customization within the flow. At its core, SchemaCoder features a novel Residual Question-Tree (Q-Tree) Boosting mechanism that iteratively refines schema extraction through targeted, adaptive queries driven by LLMs. Particularly, our method partitions logs into semantic chunks via context-bounded segmentation, selects representative patterns using embedding-based sampling, and generates schema code through hierarchical Q-Tree-driven LLM queries, iteratively refined by our textual-residual evolutionary optimizer and residual boosting. Experimental validation demonstrates SchemaCoder's superiority on the widely-used LogHub-2.0 benchmark, achieving an average improvement of 21.3% over state-of-the-arts.
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- North America > United States > Illinois > Champaign County > Champaign (0.04)
Supplementary Material: Learning Distilled Collaboration Graph for Multi-Agent Perception
V ehicles are spawned in CARLA via SUMO, and managed by the Traffic Manager. We employ the dataset format of the nuScenes and extend it to multi-agent scenarios, seen in Fig. IV. Each log file can produce 100 scenes, and each scene includes 100 frames. The input BEV map's dimension is (c, w,h) = (13, 256, 256). II.1 Architecture of student/teacher encoder We describe the architecture of the encoder below.
- Automobiles & Trucks > Manufacturer (1.00)
- Transportation > Passenger (0.98)
- Transportation > Ground > Road (0.98)
"Give me the code" -- Log Analysis of First-Year CS Students' Interactions With GPT
Alves, Pedro, Cipriano, Bruno Pereira
The impact of Large Language Models (LLMs) like GPT-3, GPT-4, and Bard in computer science (CS) education is expected to be profound. Students now have the power to generate code solutions for a wide array of programming assignments. For first-year students, this may be particularly problematic since the foundational skills are still in development and an over-reliance on generative AI tools can hinder their ability to grasp essential programming concepts. This paper analyzes the prompts used by 69 freshmen undergraduate students to solve a certain programming problem within a project assignment, without giving them prior prompt training. We also present the rules of the exercise that motivated the prompts, designed to foster critical thinking skills during the interaction. Despite using unsophisticated prompting techniques, our findings suggest that the majority of students successfully leveraged GPT, incorporating the suggested solutions into their projects. Additionally, half of the students demonstrated the ability to exercise judgment in selecting from multiple GPT-generated solutions, showcasing the development of their critical thinking skills in evaluating AI-generated code.
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Portugal (0.04)
- Asia > India > Telangana > Hyderabad (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)
What makes a good BIM design: quantitative linking between design behavior and quality
Ni, Xiang-Rui, Pan, Peng, Lin, Jia-Rui
In the Architecture Engineering & Construction (AEC) industry, how design behaviors impact design quality remains unclear. This study proposes a novel approach, which, for the first time, identifies and quantitatively describes the relationship between design behaviors and quality of design based on Building Information Modeling (BIM). Real-time collection and log mining are integrated to collect raw data of design behaviors. Feature engineering and various machine learning models are then utilized for quantitative modeling and interpretation. Results confirm an existing quantifiable relationship which can be learned by various models. The best-performing model using Extremely Random Trees achieved an R2 value of 0.88 on the test set. Behavioral features related to designer's skill level and changes of design intentions are identified to have significant impacts on design quality. These findings deepen our understanding of the design process and help forming BIM designs with better quality.
- Research Report > New Finding (1.00)
- Instructional Material > Course Syllabus & Notes (0.93)
- Construction & Engineering (1.00)
- Information Technology > Security & Privacy (0.46)
Diagnosing Robotics Systems Issues with Large Language Models
Herrmann, Jordis Emilia, Gopinath, Aswath Mandakath, Norrlof, Mikael, Müller, Mark Niklas
Quickly resolving issues reported in industrial applications is crucial to minimize economic impact. However, the required data analysis makes diagnosing the underlying root causes a challenging and time-consuming task, even for experts. In contrast, large language models (LLMs) excel at analyzing large amounts of data. Indeed, prior work in AI-Ops demonstrates their effectiveness in analyzing IT systems. Here, we extend this work to the challenging and largely unexplored domain of robotics systems. To this end, we create SYSDIAGBENCH, a proprietary system diagnostics benchmark for robotics, containing over 2500 reported issues. We leverage SYSDIAGBENCH to investigate the performance of LLMs for root cause analysis, considering a range of model sizes and adaptation techniques. Our results show that QLoRA finetuning can be sufficient to let a 7B-parameter model outperform GPT-4 in terms of diagnostic accuracy while being significantly more cost-effective. We validate our LLM-as-a-judge results with a human expert study and find that our best model achieves similar approval ratings as our reference labels.
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Sweden > Östergötland County > Linköping (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology (0.94)
- Health & Medicine > Diagnostic Medicine (0.34)
Towards Explainable Evolution Strategies with Large Language Models
This paper introduces an approach that integrates self-adaptive Evolution Strategies (ES) with Large Language Models (LLMs) to enhance the explainability of complex optimization processes. By employing a self-adaptive ES equipped with a restart mechanism, we effectively navigate the challenging landscapes of benchmark functions, capturing detailed logs of the optimization journey, including fitness evolution, step-size adjustments, and restart events due to stagnation. An LLM is then utilized to process these logs, generating concise, user-friendly summaries that highlight key aspects such as convergence behavior, optimal fitness achievements, and encounters with local optima. Our case study on the Rastrigin function demonstrates how our approach makes the complexities of ES optimization transparent and accessible. Our findings highlight the potential of using LLMs to bridge the gap between advanced optimization algorithms and their interpretability.
Automated Generation of Multiple-Choice Cloze Questions for Assessing English Vocabulary Using GPT-turbo 3.5
Wang, Qiao, Rose, Ralph, Orita, Naho, Sugawara, Ayaka
A common way of assessing language learners' mastery of vocabulary is via multiple-choice cloze (i.e., fill-in-the-blank) questions. But the creation of test items can be laborious for individual teachers or in large-scale language programs. In this paper, we evaluate a new method for automatically generating these types of questions using large language models (LLM). The VocaTT (vocabulary teaching and training) engine is written in Python and comprises three basic steps: pre-processing target word lists, generating sentences and candidate word options using GPT, and finally selecting suitable word options. To test the efficiency of this system, 60 questions were generated targeting academic words. The generated items were reviewed by expert reviewers who judged the well-formedness of the sentences and word options, adding comments to items judged not well-formed. Results showed a 75% rate of well-formedness for sentences and 66.85% rate for suitable word options. This is a marked improvement over the generator used earlier in our research which did not take advantage of GPT's capabilities. Post-hoc qualitative analysis reveals several points for improvement in future work including cross-referencing part-of-speech tagging, better sentence validation, and improving GPT prompts.
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France (0.04)