Goto

Collaborating Authors

 Agents


InferA: A Smart Assistant for Cosmological Ensemble Data

arXiv.org Artificial Intelligence

Analyzing large-scale scientific datasets presents substantial challenges due to their sheer volume, structural complexity, and the need for specialized domain knowledge. Automation tools, such as PandasAI, typically require full data ingestion and lack context of the full data structure, making them impractical as intelligent data analysis assistants for datasets at the terabyte scale. To overcome these limitations, we propose InferA, a multi-agent system that leverages large language models to enable scalable and efficient scientific data analysis. At the core of the architecture is a supervisor agent that orchestrates a team of specialized agents responsible for distinct phases of the data retrieval and analysis. The system engages interactively with users to elicit their analytical intent and confirm query objectives, ensuring alignment between user goals and system actions. To demonstrate the framework's usability, we evaluate the system using ensemble runs from the HACC cosmology simulation which comprises several terabytes.


From Literal to Liberal: A Meta-Prompting Framework for Eliciting Human-Aligned Exception Handling in Large Language Models

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly being deployed as the reasoning engines for agentic AI systems, yet they exhibit a critical flaw: a rigid adherence to explicit rules that leads to decisions misaligned with human common sense and intent. This "rule-rigidity" is a significant barrier to building trustworthy autonomous agents. While prior work has shown that supervised fine-tuning (SFT) with human explanations can mitigate this issue, SFT is computationally expensive and inaccessible to many practitioners. To address this gap, we introduce the Rule-Intent Distinction (RID) Framework, a novel, low-compute meta-prompting technique designed to elicit human-aligned exception handling in LLMs in a zero-shot manner. The RID framework provides the model with a structured cognitive schema for deconstructing tasks, classifying rules, weighing conflicting outcomes, and justifying its final decision. We evaluated the RID framework against baseline and Chain-of-Thought (CoT) prompting on a custom benchmark of 20 scenarios requiring nuanced judgment across diverse domains. Our human-verified results demonstrate that the RID framework significantly improves performance, achieving a 95% Human Alignment Score (HAS), compared to 80% for the baseline and 75% for CoT. Furthermore, it consistently produces higher-quality, intent-driven reasoning. This work presents a practical, accessible, and effective method for steering LLMs from literal instruction-following to liberal, goal-oriented reasoning, paving the way for more reliable and pragmatic AI agents.


VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages

arXiv.org Artificial Intelligence

Vision Language Models (VLMs) are pivotal for advancing perception in intelligent agents. Yet, evaluation of VLMs remains limited to predominantly English-centric benchmarks in which the image-text pairs comprise short texts. To evaluate VLM fine-grained abilities, in four languages under long-text settings, we introduce a novel multilingual benchmark VLURes featuring eight vision-and-language tasks, and a pioneering unrelatedness task, to probe the fine-grained Visual and Linguistic Understanding capabilities of VLMs across English, Japanese, and low-resource languages, Swahili, and Urdu. Our datasets, curated from web resources in the target language, encompass ten diverse image categories and rich textual context, introducing valuable vision-language resources for Swahili and Urdu. By prompting VLMs to generate responses and rationales, evaluated automatically and by native speakers, we uncover performance disparities across languages and tasks critical to intelligent agents, such as object recognition, scene understanding, and relationship understanding. We conducted evaluations of ten VLMs with VLURes. The best performing model, GPT-4o, achieves an overall accuracy of 90.8% and lags human performance by 6.7%, though the gap is larger for open-source models. The gap highlights VLURes' critical role in developing intelligent agents to tackle multi-modal visual reasoning.


AutoPR: Let's Automate Your Academic Promotion!

arXiv.org Artificial Intelligence

As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.


Coordination Requires Simplification: Thermodynamic Bounds on Multi-Objective Compromise in Natural and Artificial Intelligence

arXiv.org Artificial Intelligence

Information-processing systems that coordinate multiple agents and objectives face fundamental thermodynamic constraints. We show that solutions with maximum utility to act as coordination focal points have a much higher selection pressure for being findable across agents rather than accuracy. We derive that the information-theoretic minimum description length of coordination protocols to precision $\varepsilon$ scales as $L(P)\geq NK\log_2 K+N^2d^2\log (1/\varepsilon)$ for $N$ agents with $d$ potentially conflicting objectives and internal model complexity $K$. This scaling forces progressive simplification, with coordination dynamics changing the environment itself and shifting optimization across hierarchical levels. Moving from established focal points requires re-coordination, creating persistent metastable states and hysteresis until significant environmental shifts trigger phase transitions through spontaneous symmetry breaking. We operationally define coordination temperature to predict critical phenomena and estimate coordination work costs, identifying measurable signatures across systems from neural networks to restaurant bills to bureaucracies. Extending the topological version of Arrow's theorem on the impossibility of consistent preference aggregation, we find it recursively binds whenever preferences are combined. This potentially explains the indefinite cycling in multi-objective gradient descent and alignment faking in Large Language Models trained with reinforcement learning with human feedback. We term this framework Thermodynamic Coordination Theory (TCT), which demonstrates that coordination requires radical information loss.


FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

arXiv.org Artificial Intelligence

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.


Winners and Losers of the AI Revolution: Artificial Intelligence Is Radically Changing the Employment Landscape

Der Spiegel International

Artificial intelligence is becoming a permanent element in the world of work, with Silicon Valley calling it the dawning of a new age. Many people are afraid of losing their job, but Germany is well-prepared. In the northern part of the U.S. state of Louisiana, right next to the prison on the outskirts of Shreveport, looms a gigantic building of concrete and steel. Welcome to the future," reads a colorful greeting painted on the wall at the entrance, right next to the obligatory American flag. It is 9:30 a.m., a busy time of day. Yet the halls and corridors of SHV1, as the building is referred to internally, are completely empty of people. A blueprint for the future," as the site manager calls it. The Seattle-based company operates the largest fleet of industrial robots in the world, more than a million of them, and many are outfitted with artificial intelligence, helping them to lift, sort, search, weigh and scan. Guided and directed completely by AI. Without the massive use of this technology," says Aaron Parness, a former NASA aerospace engineer who now heads up the retail giant's AI robotic department, we would be a different company." The article you are reading originally appeared in German in issue 41/2025 (October 2nd, 2025) of DER SPIEGEL. Amazon, though, also employs people. But their role is changing rapidly.


Future-proofing business capabilities with AI technologies

MIT Technology Review

AI agents are moving from pilots to enterprise scale, but trust and governance remain the linchpins of success, says Cloudera's Manasi Vartak, Forrester's Mike Gualtieri, and AWS' Eddie Kim. Artificial intelligence has always promised speed, efficiency, and new ways of solving problems. But what's changed in the past few years is how quickly those promises are becoming reality. From oil and gas to retail, logistics to law, AI is no longer confined to pilot projects or speculative labs. It is being deployed in critical workflows, reducing processes that once took hours to just minutes, and freeing up employees to focus on higher-value work. "Business process automation has been around a long while.


AI Agents for the Dhumbal Card Game: A Comparative Study

arXiv.org Artificial Intelligence

Abstract--This study evaluates Artificial Intelligence (AI) agents for Dhumbal, a culturally significant multiplayer card game with imperfect information, through a systematic comparison of rule-based, search-based, and learning-based strategies. We formalize Dhumbal's mechanics and implement diverse agents, including heuristic approaches (Aggressive, Conservative, Balanced, Opportunistic), search-based methods such as Monte Carlo Tree Search (MCTS) and Information Set Monte Carlo Tree Search (ISMCTS), and reinforcement learning approaches including Deep Q-Network (DQN) and Proximal Policy Optimization (PPO), and a random baseline. Evaluation involves within-category tournaments followed by a cross-category championship. Performance is measured via win rate, economic outcome, Jhyap success, cards discarded per round, risk assessment, and decision efficiency. Statistical significance is assessed using Welch's t-test with Bonferroni correction, effect sizes via Cohen's d, and 95% confidence intervals (CI). Across 1024 simulated rounds, the rule-based Aggressive agent achieves the highest win rate (88.3%, 95% CI: [86.3, 90.3]), outperforming ISMCTS (9.0%) and PPO (1.5%) through effective exploitation of Jhyap declarations. The study contributes a reproducible AI framework, insights into heuristic efficacy under partial information, and open-source code, thereby advancing AI research and supporting digital preservation of cultural games. HUMBAL, also known as Jhyap in Nepal and Y aniv in Israel, is a traditional draw-and-discard card game that combines strategic decision-making, imperfect information, and risk management. It is widely played across South Asia during family gatherings, festivals, and social events, fostering intergenerational bonds and reflecting communal spirit [1]. Played with 2 to 5 players using a standard 52-card deck, the objective is to minimize the total point value of cards in hand.


Nash Equilibria in Games with Playerwise Concave Coupling Constraints: Existence and Computation

arXiv.org Artificial Intelligence

We study the existence and computation of Nash equilibria in continuous static games where the players' admissible strategies are subject to shared coupling constraints, i.e., constraints that depend on their \emph{joint} strategies. Specifically, we focus on a class of games characterized by playerwise concave utilities and playerwise concave constraints. Prior results on the existence of Nash equilibria are not applicable to this class, as they rely on strong assumptions such as joint convexity of the feasible set. By leveraging topological fixed point theory and novel structural insights into the contractibility of feasible sets under playerwise concave constraints, we give an existence proof for Nash equilibria under weaker conditions. Having established existence, we then focus on the computation of Nash equilibria via independent gradient methods under the additional assumption that the utilities admit a potential function. To account for the possibly nonconvex feasible region, we employ a log barrier regularized gradient ascent with adaptive stepsizes. Starting from an initial feasible strategy profile and under exact gradient feedback, the proposed method converges to an $ε$-approximate constrained Nash equilibrium within $\mathcal{O}(ε^{-3})$ iterations.