Goto

Collaborating Authors

 maverick


How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

Roig, JV

arXiv.org Artificial Intelligence

We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.


Strategic Communication and Language Bias in Multi-Agent LLM Coordination

Buscemi, Alessio, Proverbio, Daniele, Di Stefano, Alessandro, Han, The Anh, Castignani, German, Liò, Pietro

arXiv.org Artificial Intelligence

Large Language Model (LLM)-based agents are increasingly deployed in multi-agent scenarios where coordination is crucial but not always assured. Research shows that the way strategic scenarios are framed linguistically can affect cooperation. This paper explores whether allowing agents to communicate amplifies these language-driven effects. Leveraging FAIRGAME, we simulate one-shot and repeated games across different languages and models, both with and without communication. Our experiments, conducted with two advanced LLMs-GPT-4o and Llama 4 Maverick-reveal that communication significantly influences agent behavior, though its impact varies by language, personality, and game structure. These findings underscore the dual role of communication in fostering coordination and reinforcing biases.


ImCoref-CeS: An Improved Lightweight Pipeline for Coreference Resolution with LLM-based Checker-Splitter Refinement

Luo, Kangyang, Bai, Yuzhuo, Si, Shuzheng, Gao, Cheng, Wang, Zhitong, Shen, Yingli, Li, Wenhao, Liu, Zhu, Han, Yufeng, Wu, Jiayi, Kong, Cunliang, Sun, Maosong

arXiv.org Artificial Intelligence

Coreference Resolution (CR) is a critical task in Natural Language Processing (NLP). Current research faces a key dilemma: whether to further explore the potential of supervised neural methods based on small language models, whose detect-then-cluster pipeline still delivers top performance, or embrace the powerful capabilities of Large Language Models (LLMs). However, effectively combining their strengths remains underexplored. To this end, we propose \textbf{ImCoref-CeS}, a novel framework that integrates an enhanced supervised model with LLM-based reasoning. First, we present an improved CR method (\textbf{ImCoref}) to push the performance boundaries of the supervised neural method by introducing a lightweight bridging module to enhance long-text encoding capability, devising a biaffine scorer to comprehensively capture positional information, and invoking a hybrid mention regularization to improve training efficiency. Importantly, we employ an LLM acting as a multi-role Checker-Splitter agent to validate candidate mentions (filtering out invalid ones) and coreference results (splitting erroneous clusters) predicted by ImCoref. Extensive experiments demonstrate the effectiveness of ImCoref-CeS, which achieves superior performance compared to existing state-of-the-art (SOTA) methods.


Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case

González-Bustamante, Bastián, Verelst, Nando, Cisternas, Carla

arXiv.org Artificial Intelligence

Traditional public opinion surveys face a number of challenges and risks related to measurement and representation dimensions, including, for example, coverage error due to incomplete frames and hard-to-reach groups, sampling error resulting from finite samples and complex designs, nonresponse error stemming from low participation and interview fatigue, measurement error introduced by questionnaire wording, and processing errors in coding and post-survey adjustments, among others (Groves, 1989; Groves and Lyberg, 2010; Weisberg, 2005). These errors could be amplified by substantial financial, human, and logistical demands, such as time spent on instrument design, piloting, and fieldwork that often forces a cost-quality trade-off that may distort population inferences. Consequently, there is a growing demand in the social sciences and market research for methods that reduce burden and cost while maintaining and improving overall data quality. Against this backdrop, Large Language Models (LLMs), trained extensively on vast and diverse data, emerge as promising alternatives for new research possibilities and applied research, including handling the abovementioned survey research limitations and measurement and representation errors. Indeed, recent advances in generative artificial intelligence (AI) suggest LLMs could serve for a number of classification tasks, including the creation of synthetic samples, providing simulated responses reflective of broader societal attitudes and behaviours (Argyle et al., 2023; Gilardi et al., 2023; González-Bustamante, 2024). The synthetic samples specifically may leverage the ability of LLMs 2 to generate contextually informed responses based on individual-level demographic characteristics and attitudes, and, in this way, potentially emulate public opinion without direct interaction with human respondents. This methodological innovation opens new avenues for rapid data collection, experimentation with sensitive topics, and a deeper understanding of complex public opinion dynamics that complement or even partially substitute for traditional surveys. Thus, the primary objective of this working paper is to evaluate the effectiveness and reliability of LLM-generated synthetic survey responses in reflecting real-world public opinion in Chile. Specifically, we aim to assess the predictive accuracy of a number of state-of-the-art private and open-source LLMs by comparing their synthetic respondents against human probabilistic responses.


BOOKCOREF: Coreference Resolution at Book Scale

Martinelli, Giuliano, Bonomo, Tommaso, Cabot, Pere-Lluís Huguet, Navigli, Roberto

arXiv.org Artificial Intelligence

Coreference Resolution systems are typically evaluated on benchmarks containing small- to medium-scale documents. When it comes to evaluating long texts, however, existing benchmarks, such as LitBank, remain limited in length and do not adequately assess system capabilities at the book scale, i.e., when co-referring mentions span hundreds of thousands of tokens. To fill this gap, we first put forward a novel automatic pipeline that produces high-quality Coreference Resolution annotations on full narrative texts. Then, we adopt this pipeline to create the first book-scale coreference benchmark, BOOKCOREF, with an average document length of more than 200,000 tokens. We carry out a series of experiments showing the robustness of our automatic procedure and demonstrating the value of our resource, which enables current long-document coreference systems to gain up to +20 CoNLL-F1 points when evaluated on full books. Moreover, we report on the new challenges introduced by this unprecedented book-scale setting, highlighting that current models fail to deliver the same performance they achieve on smaller documents. We release our data and code to encourage research and development of new book-scale Coreference Resolution systems at https://github.com/sapienzanlp/bookcoref.


Meta introduces Llama 4 with two new AI models available now, and two more on the way

Engadget

Meta has released the first two models from its multimodal Llama 4 suite: LLama 4 Scout and Llama 4 Maverick. Maverick is "the workhorse" of the two and excels at image and text understanding for "general assistant and chat use cases," the company said in a blog post, while the smaller model Scout could tackle things like "multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over vast codebases." The company also introduced Llama 4 Behemoth, an upcoming model it says is "among the world's smartest LLMs" -- and CEO Mark Zuckerberg said we'll be hearing about a fourth model, LLama 4 Reasoning, "in the next month." Both Maverick and Scout are available to download now from the LLama website and Hugging Face, and they've been added to Meta AI, including for WhatsApp, Messenger and Instagram DMs. Scout has 17 billion active parameters with 16 experts, Meta says.


Maverick: Efficient and Accurate Coreference Resolution Defying Recent Trends

Martinelli, Giuliano, Barba, Edoardo, Navigli, Roberto

arXiv.org Artificial Intelligence

Large autoregressive generative models have emerged as the cornerstone for achieving the highest performance across several Natural Language Processing tasks. However, the urge to attain superior results has, at times, led to the premature replacement of carefully designed task-specific approaches without exhaustive experimentation. The Coreference Resolution task is no exception; all recent state-of-the-art solutions adopt large generative autoregressive models that outperform encoder-based discriminative systems. In this work,we challenge this recent trend by introducing Maverick, a carefully designed - yet simple - pipeline, which enables running a state-of-the-art Coreference Resolution system within the constraints of an academic budget, outperforming models with up to 13 billion parameters with as few as 500 million parameters. Maverick achieves state-of-the-art performance on the CoNLL-2012 benchmark, training with up to 0.006x the memory resources and obtaining a 170x faster inference compared to previous state-of-the-art systems. We extensively validate the robustness of the Maverick framework with an array of diverse experiments, reporting improvements over prior systems in data-scarce, long-document, and out-of-domain settings. We release our code and models for research purposes at https://github.com/SapienzaNLP/maverick-coref.


Maverick-Aware Shapley Valuation for Client Selection in Federated Learning

Yang, Mengwei, Jarin, Ismat, Buyukates, Baturalp, Avestimehr, Salman, Markopoulou, Athina

arXiv.org Artificial Intelligence

Federated Learning (FL) allows clients to train a model collaboratively without sharing their private data. One key challenge in practical FL systems is data heterogeneity, particularly in handling clients with rare data, also referred to as Mavericks. These clients own one or more data classes exclusively, and the model performance becomes poor without their participation. Thus, utilizing Mavericks throughout training is crucial. In this paper, we first design a Maverick-aware Shapley valuation that fairly evaluates the contribution of Mavericks. The main idea is to compute the clients' Shapley values (SV) class-wise, i.e., per label. Next, we propose FedMS, a Maverick-Shapley client selection mechanism for FL that intelligently selects the clients that contribute the most in each round, by employing our Maverick-aware SV-based contribution score. We show that, compared to an extensive list of baselines, FedMS achieves better model performance and fairer Shapley Rewards distribution.


Tom Cruise gave 'Mission: Impossible' co-stars skydiving lessons, shark diving trips and coconut cake

FOX News

Go behind the scenes with Tom Cruise as he performs one of the "most dangerous sports in the world" for "Mission: Impossible." (Credit: Paramount Picture/Skydance) Tom Cruise's "Mission: Impossible – Dead Reckoning" co-stars have revealed the "thoughtful" gifts the 61-year-old star gave them during filming, and they are fitting of an action hero. "He's always very keen to show his appreciation," Simon Pegg, who also starred with Cruise in the last four "Mission: Impossible" films, told People magazine. "I think he's so used to being the focus of attention, it's naturally his instinct to kind of reflect everything back. And he's always incredibly sort of generous in terms of his gratitude to us and how he thanks us and how he lets us know that we're valued." Pegg said one day, when the cast had the afternoon off, Cruise flew them in a helicopter to go shark diving.


A step toward safe and reliable autopilots for flying

Robohub

MIT researchers developed a machine-learning technique that can autonomously drive a car or fly a plane through a very difficult "stabilize-avoid" scenario, in which the vehicle must stabilize its trajectory to arrive at and stay within some goal region, while avoiding obstacles. In the film "Top Gun: Maverick," Maverick, played by Tom Cruise, is charged with training young pilots to complete a seemingly impossible mission -- to fly their jets deep into a rocky canyon, staying so low to the ground they cannot be detected by radar, then rapidly climb out of the canyon at an extreme angle, avoiding the rock walls. Spoiler alert: With Maverick's help, these human pilots accomplish their mission. A machine, on the other hand, would struggle to complete the same pulse-pounding task. To an autonomous aircraft, for instance, the most straightforward path toward the target is in conflict with what the machine needs to do to avoid colliding with the canyon walls or staying undetected.