AITopics

In this work, we report what happens when two large language models respond to each other for many turns without any outside input in a multi-agent setup. The setup begins with a short seed sentence. After that, each model reads the other's output and generates a response. This continues for a fixed number of steps. We used Mistral Nemo Base 2407 and Llama 2 13B hf. We observed that most conversations start coherently but later fall into repetition. In many runs, a short phrase appears and repeats across turns. Once repetition begins, both models tend to produce similar output rather than introducing a new direction in the conversation. This leads to a loop where the same or similar text is produced repeatedly. We describe this behavior as a form of convergence. It occurs even though the models are large, trained separately, and not given any prompt instructions. To study this behavior, we apply lexical and embedding-based metrics to measure how far the conversation drifts from the initial seed and how similar the outputs of the two models becomes as the conversation progresses.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

2512.06256

Country: North America > United States > West Virginia (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Masters, Charlie, Grześkiewicz, Marta, Albrecht, Stefano V.

ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment

As agents based on large language models are increasingly deployed to long-horizon tasks, maintaining their alignment with stakeholder preferences becomes critical. Effective alignment in such settings requires reward models that are interpretable so that stakeholders can understand and audit model objectives. Moreover, reward models must be capable of steering agents at interaction time, allowing preference shifts to be incorporated without retraining. We introduce ARCANE, a framework that frames alignment as a multi-agent collaboration problem that dynamically represents stakeholder preferences as natural-language rubrics: weighted sets of verifiable criteria that can be generated on-the-fly from task context. Inspired by utility theory, we formulate rubric learning as a reconstruction problem and apply a regularized Group-Sequence Policy Optimization (GSPO) procedure that balances interpretability, faithfulness, and computational efficiency. Using a corpus of 219 labeled rubrics derived from the GDPV al benchmark, we evaluate ARCANE on challenging tasks requiring multi-step reasoning and tool use. The learned rubrics produce compact, legible evaluations and enable configurable trade-offs (e.g., correctness vs. conciseness) without retraining. Our results show that rubric-based reward models offer a promising path toward interpretable, test-time adaptive alignment for complex, long-horizon AI systems.

artificial intelligence, machine learning, natural language, (19 more...)

2512.06196

Country:

Europe > United Kingdom > England (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Das, Ratnangshu, Upadhyay, Siddhartha, Jagtap, Pushpak

Real-Time Spatiotemporal Tubes for Dynamic Unsafe Sets

This paper presents a real-time control framework for nonlinear pure-feedback systems with unknown dynamics to satisfy reach-avoid-stay tasks within a prescribed time in dynamic environments. To achieve this, we introduce a real-time spatiotemporal tube (STT) framework. An STT is defined as a time-varying ball in the state space whose center and radius adapt online using only real-time sensory input. A closed-form, approximation-free control law is then derived to constrain the system output within the STT, ensuring safety and task satisfaction. We provide formal guarantees for obstacle avoidance and on-time task completion. The effectiveness and scalability of the framework are demonstrated through simulations and hardware experiments on a mobile robot and an aerial vehicle, navigating in cluttered dynamic environments.

artificial intelligence, obstacle, real time system, (19 more...)

2512.06151

Country: Asia > India (0.29)

Genre: Research Report (0.50)

Industry: Government (0.46)

Technology:

Information Technology > Architecture > Real Time Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.49)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.46)

Hariharan, Mohanakrishnan

Reinforcement Learning Integrated Agentic RAG for Software Test Cases Authoring

This paper introduces a framework that integrates reinforcement learning (RL) with autonomous agents to enable continuous improvement in the automated process of software test cases authoring from business requirement documents within Quality Engineering (QE) workflows. Conventional systems employing Large Language Models (LLMs) generate test cases from static knowledge bases, which fundamentally limits their capacity to enhance performance over time. Our proposed Reinforcement Infused Agentic RAG (Retrieve, Augment, Generate) framework overcomes this limitation by employing AI agents that learn from QE feedback, assessments, and defect discovery outcomes to automatically improve their test case generation strategies. The system combines specialized agents with a hybrid vector-graph knowledge base that stores and retrieves software testing knowledge. Through advanced RL algorithms, specifically Proximal Policy Optimization (PPO) and Deep Q-Networks (DQN), these agents optimize their behavior based on QE-reported test effectiveness, defect detection rates, and workflow metrics. As QEs execute AI-generated test cases and provide feedback, the system learns from this expert guidance to improve future iterations. Experimental validation on enterprise Apple projects yielded substantive improvements: a 2.4% increase in test generation accuracy (from 94.8% to 97.2%), and a 10.8% improvement in defect detection rates. The framework establishes a continuous knowledge refinement loop driven by QE expertise, resulting in progressively superior test case quality that enhances, rather than replaces, human testing capabilities.

machine learning, natural language, reinforcement learning, (14 more...)

2512.0606

Country: North America > United States (0.14)

Genre:

Instructional Material > Course Syllabus & Notes (0.54)
Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Beyond Prototyping: Autonomous, Enterprise-Grade Frontend Development from Pixel to Production via a Specialized Multi-Agent Framework

Ganesaraja, Ramprasath, N, Swathika, AP, Saravanan, Rathinasamy, Kamalkumar, Amancharla, Chetana, Das, Rahul, Panse, Sahil Dilip, Batwe, Aditya, Vijayan, Dileep, Ashok, Veena, P, Thanushree A, Rao, Kausthubh J, Olivero, Alden, Roshan, null, Manthena, Rajeshwar Reddy, A, Asmitha Yuga Sre, Tripathi, Harsh, Selvaraj, Suganya, Chin, Vito, Bhaskar, Kasthuri Rangan, Bhaskar, Kasthuri Rangan, R, Venkatraman, Vijayakumar, Sajit

We present AI4UI, a framework of autonomous front-end development agents purpose-built to meet the rigorous requirements of enterprise-grade application delivery. Unlike general-purpose code assistants designed for rapid prototyping, AI4UI focuses on production readiness delivering secure, scalable, compliant, and maintainable UI code integrated seamlessly into enterprise workflows. AI4UI operates with targeted human-in-the-loop involvement: at the design stage, developers embed a Gen-AI-friendly grammar into Figma prototypes to encode requirements for precise interpretation; and at the post processing stage, domain experts refine outputs for nuanced design adjustments, domain-specific optimizations, and compliance needs. Between these stages, AI4UI runs fully autonomously, converting designs into engineering-ready UI code. Technical contributions include a Figma grammar for autonomous interpretation, domain-aware knowledge graphs, a secure abstract/package code integration strategy, expertise driven architecture templates, and a change-oriented workflow coordinated by specialized agent roles. In large-scale benchmarks against industry baselines and leading competitor systems, AI4UI achieved 97.24% platform compatibility, 87.10% compilation success, 86.98% security compliance, 78.00% feature implementation success, 73.50% code-review quality, and 73.36% UI/UX consistency. In blind preference studies with 200 expert evaluators, AI4UI emerged as one of the leaders demonstrating strong competitive standing among leading solutions. Operating asynchronously, AI4UI generates thousands of validated UI screens in weeks rather than months, compressing delivery timeline

agent, ai4ui, artificial intelligence, (17 more...)

2512.06046

Genre:

Workflow (1.00)
Research Report > Experimental Study (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization

Xuefei, null, Wang, null, Horstmann, Kai A., Lin, Ethan, Chen, Jonathan, Farhang, Alexander R., Stiles, Sophia, Sehgal, Atharva, Light, Jonathan, Van Valen, David, Yue, Yisong, Sun, Jennifer J.

Adapting production-level computer vision tools to bespoke scientific datasets is a critical "last mile" bottleneck. Current solutions are impractical: fine-tuning requires large annotated datasets scientists often lack, while manual code adaptation costs scientists weeks to months of effort. W e consider using AI agents to automate this manual coding, and focus on the open question of optimal agent design for this targeted task. W e introduce a systematic evaluation framework for agentic code optimization and use it to study three production-level biomedical imaging pipelines. W e demonstrate that a simple agent framework consistently generates adaptation code that outperforms human-expert solutions. Our analysis reveals that common, complex agent architectures are not universally beneficial, leading to a practical roadmap for agent design.

artificial intelligence, deep learning, machine learning, (17 more...)

2512.06006

Country: North America > United States (1.00)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Briman, Eyal, Shapiro, Ehud, Talmon, Nimrod

AI-Generated Compromises for Coalition Formation: Modeling, Simulation, and a Textual Case Study

The challenge of finding compromises between agent proposals is fundamental to AI sub-fields such as argumentation, mediation, and negotiation. Building on this tradition, Elkind et al. (2021) introduced a process for coalition formation that seeks majority-supported proposals preferable to the status quo, using a metric space where each agent has an ideal point. The crucial step in this iterative process involves identifying compromise proposals around which agent coalitions can unite. How to effectively find such compromise proposals, however, remains an open question. We address this gap by formalizing a holistic model that encompasses agent bounded rationality and uncertainty and developing AI models to generate such compromise proposals. We focus on the domain of collaboratively writing text documents -- e.g., to enable the democratic creation of a community constitution. We apply NLP (Natural Language Processing) techniques and utilize LLMs (Large Language Models) to create a semantic metric space for text and develop algorithms to suggest suitable compromise points. To evaluate the effectiveness of our algorithms, we simulate various coalition formation processes and demonstrate the potential of AI to facilitate large-scale democratic text editing, such as collaboratively drafting a constitution, an area where traditional tools are limited.

large language model, machine learning, natural language, (21 more...)

doi: 10.4204/EPTCS.437.32

2512.05983

Country:

Europe (0.67)
North America > United States (0.46)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.64)

Industry: Law (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

FlockVote: LLM-Empowered Agent-Based Modeling for Simulating U.S. Presidential Elections

Zhou, Lingfeng, Xu, Yi, Wang, Zhenyu, Wang, Dequan

Modeling complex human behavior, such as voter decisions in national elections, is a long-standing challenge for computational social science. Traditional agent-based models (ABMs) are limited by oversimplified rules, while large-scale statistical models often lack interpretability. We introduce FlockVote, a novel framework that uses Large Language Models (LLMs) to build a "computational laboratory" of LLM agents for political simulation. Each agent is instantiated with a high-fidelity demographic profile and dynamic contextual information (e.g. candidate policies), enabling it to perform nuanced, generative reasoning to simulate a voting decision. We deploy this framework as a testbed on the 2024 U.S. Presidential Election, focusing on seven key swing states. Our simulation's macro-level results successfully replicate the real-world outcome, demonstrating the high fidelity of our "virtual society". The primary contribution is not only the prediction, but also the framework's utility as an interpretable research tool. FlockVote moves beyond black-box outputs, allowing researchers to probe agent-level rationale and analyze the stability and sensitivity of LLM-driven social simulations.

large language model, machine learning, natural language, (20 more...)

2512.05982

Country: North America > United States (1.00)

Genre:

Questionnaire & Opinion Survey (0.92)
Research Report > New Finding (0.67)

Industry:

Government > Voting & Elections (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

PPTArena: A Benchmark for Agentic PowerPoint Editing

Ofengenden, Michael, Man, Yunze, Pang, Ziqi, Wang, Yu-Xiong

W e introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2,125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.

benchmark, large language model, machine learning, (23 more...)

2512.03042

Country:

North America > United States (0.92)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Information Technology > Communications (0.93)
(3 more...)

Large Language Models Miss the Multi-Agent Mark

La Malfa, Emanuele, La Malfa, Gabriele, Marro, Samuele, Zhang, Jie M., Black, Elizabeth, Luck, Michael, Torr, Philip, Wooldridge, Michael

Recent interest in Multi-Agent Systems of Large Language Models (MAS LLMs) has led to an increase in frameworks leveraging multiple LLMs to tackle complex tasks. However, much of this literature appropriates the terminology of MAS without engaging with its foundational principles. In this position paper, we highlight critical discrepancies between MAS theory and current MAS LLMs implementations, focusing on four key areas: the social aspect of agency, environment design, coordination and communication protocols, and measuring emergent behaviours. Our position is that many MAS LLMs lack multi-agent characteristics such as autonomy, social interaction, and structured environments, and often rely on oversimplified, LLM-centric architectures. The field may slow down and lose traction by revisiting problems the MAS literature has already addressed. Therefore, we systematically analyse this issue and outline associated research opportunities; we advocate for better integrating established MAS concepts and more precise terminology to avoid mischaracterisation and missed opportunities.

artificial intelligence, large language model, natural language, (16 more...)

2505.21298

Country:

North America > United States (0.28)
Europe > United Kingdom > England (0.28)

Genre:

Research Report (1.00)
Overview (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)