Ramchandran, Kannan
Why Do Multi-Agent LLM Systems Fail?
Cemri, Mert, Pan, Melissa Z., Yang, Shuyi, Agrawal, Lakshya A., Chopra, Bhavya, Tiwari, Rishabh, Keutzer, Kurt, Parameswaran, Aditya, Klein, Dan, Ramchandran, Kannan, Zaharia, Matei, Gonzalez, Joseph E., Stoica, Ion
Despite growing enthusiasm for Multi-Agent Systems (MAS), where multiple LLM agents collaborate to accomplish tasks, their performance gains across popular benchmarks remain minimal compared to single-agent frameworks. This gap highlights the need to analyze the challenges hindering MAS effectiveness. In this paper, we present the first comprehensive study of MAS challenges. We analyze five popular MAS frameworks across over 150 tasks, involving six expert human annotators. We identify 14 unique failure modes and propose a comprehensive taxonomy applicable to various MAS frameworks. This taxonomy emerges iteratively from agreements among three expert annotators per study, achieving a Cohen's Kappa score of 0.88. These fine-grained failure modes are organized into 3 categories, (i) specification and system design failures, (ii) inter-agent misalignment, and (iii) task verification and termination. To support scalable evaluation, we integrate MASFT with LLM-as-a-Judge. We also explore if identified failures could be easily prevented by proposing two interventions: improved specification of agent roles and enhanced orchestration strategies. Our findings reveal that identified failures require more complex solutions, highlighting a clear roadmap for future research. We open-source our dataset and LLM annotator.
Online Assortment and Price Optimization Under Contextual Choice Models
Erginbas, Yigit Efe, Courtade, Thomas A., Ramchandran, Kannan
In online marketplaces, dynamic assortment selection and pricing for sequentially arriving buyers presents a challenge for online learning. Since the preferences of buyers are varying and uncertain, adaptive strategies are essential to meet their needs and maximize the effectiveness of offers. To address this problem, we investigate the application of online learning techniques for contextual assortment selection and pricing. Assortment selection involves the seller choosing a subset of items from a vast catalog to present to buyers, and dynamically assigning prices to the offered items. The overall goal is to maximize revenue over the course of repeated interactions. Dynamic assortment selection and pricing strategies are deployed in a variety of online sectors including e-commerce (e.g., Amazon), food delivery (e.g., Uber Eats), and hospitality (e.g., Airbnb). With similar systems becoming ubiquitous in our daily lives, there is a growing opportunity to deliver tailored product recommendations and pricing adjustments. Therefore, it is crucial to consider data-driven approaches that can enhance user experiences and boost profitability in today's highly competitive digital industry.
SPEX: Scaling Feature Interaction Explanations for LLMs
Kang, Justin Singh, Butler, Landon, Agarwal, Abhineet, Erginbas, Yigit Efe, Pedarsani, Ramtin, Ramchandran, Kannan, Yu, Bin
Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP provide marginal feature attributions, while their extensions to interaction importances only scale to small input lengths ($\approx 20$). We propose Spectral Explainer (SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths ($\approx 1000)$. SPEX exploits underlying natural sparsity among interactions -- common in real-world data -- and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output. For one of our datasets, HotpotQA, SPEX provides interactions that align with human annotations. Finally, we use our model-agnostic approach to generate explanations to demonstrate abstract reasoning in closed-source LLMs (GPT-4o mini) and compositional reasoning in vision-language models.
From Markov to Laplace: How Mamba In-Context Learns Markov Chains
Bondaschi, Marco, Rajaraman, Nived, Wei, Xiuying, Ramchandran, Kannan, Pascanu, Razvan, Gulcehre, Caglar, Gastpar, Michael, Makkuva, Ashok Vardhan
While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering a surprising phenomenon: unlike transformers, even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal, for all Markovian orders. To explain this, we theoretically characterize the representation capacity of Mamba and reveal the fundamental role of convolution in enabling it to represent the optimal Laplacian smoothing. These theoretical insights align strongly with empirical results and, to the best of our knowledge, represent the first formal connection between Mamba and optimal statistical estimators. Finally, we outline promising research directions inspired by these findings.
VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data
Zeng, Thomas, Zhang, Shuibai, Wu, Shutong, Classen, Christian, Chae, Daewon, Ewer, Ethan, Lee, Minjae, Kim, Heeju, Kang, Wonjun, Kunde, Jackson, Fan, Ying, Kim, Jungtaek, Koo, Hyung Il, Ramchandran, Kannan, Papailiopoulos, Dimitris, Lee, Kangwook
In particular, Outcome Reward Models (ORMs) are Process Reward Models (PRMs) have proven used to provide supervision based solely on the correctness effective at enhancing mathematical reasoning of the final outcome. However, ORMs fail to address errors for Large Language Models (LLMs) by leveraging in intermediate steps, limiting their effectiveness for increased inference-time computation. However, complex, multi-step reasoning tasks (Luo et al., 2024; Lightman they are predominantly trained on mathematical et al., 2024; Sun et al., 2024). Because ORMs suffer data and their generalizability to nonmathematical from this limitation, Process Reward Models (PRMs) have domains has not been rigorously been proposed to offer fine-grained, step-by-step feedback studied. In response, this work first shows that on the correctness of each reasoning step (Lightman et al., current PRMs have poor performance in other 2024; Uesato et al., 2022). PRMs have proven highly effective domains. To address this limitation, we introduce during inference, improving the reranking of generated VersaPRM, a multi-domain PRM trained solutions and guiding LLMs through search-based on synthetic reasoning data generated using our algorithms (Wan et al., 2024; Wang et al., 2024a).
Quantifying Positional Biases in Text Embedding Models
Goel, Samarth, Lee, Reagan J., Ramchandran, Kannan
Embedding models are crucial for tasks in Information Retrieval (IR) and semantic similarity measurement, yet their handling of longer texts and associated positional biases remains underexplored. In this study, we investigate the impact of content position and input size on text embeddings. Our experiments reveal that embedding models, irrespective of their positional encoding mechanisms, disproportionately prioritize the beginning of an input. Ablation studies demonstrate that insertion of irrelevant text or removal at the start of a document reduces cosine similarity between altered and original embeddings by up to 12.3% more than ablations at the end. Regression analysis further confirms this bias, with sentence importance declining as position moves further from the start, even with with content-agnosticity. We hypothesize that this effect arises from pre-processing strategies and chosen positional encoding techniques. These findings quantify the sensitivity of retrieval systems and suggest a new lens towards embedding model robustness.
EmbedLLM: Learning Compact Representations of Large Language Models
Zhuang, Richard, Wu, Tianhao, Wen, Zhaojin, Li, Andrew, Jiao, Jiantao, Ramchandran, Kannan
With hundreds of thousands of language models available on Huggingface today, efficiently evaluating and utilizing these models across various downstream, tasks has become increasingly critical. Many existing methods repeatedly learn task-specific representations of Large Language Models (LLMs), which leads to inefficiencies in both time and computational resources. To address this, we propose EmbedLLM, a framework designed to learn compact vector representations, of LLMs that facilitate downstream applications involving many models, such as model routing. We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency. Additionally, we demonstrate that our method can forecast a model's performance on multiple benchmarks, without incurring additional inference cost. Extensive probing experiments validate that the learned embeddings capture key model characteristics, e.g. whether the model is specialized for coding tasks, even without being explicitly trained on them. We open source our dataset, code and embedder to facilitate further research and application.
Looped Transformers for Length Generalization
Fan, Ying, Du, Yilun, Ramchandran, Kannan, Lee, Kangwook
Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same length, they struggle with length generalization, i.e., handling inputs of unseen lengths. In this work, we demonstrate that looped Transformers with an adaptive number of steps significantly improve length generalization. We focus on tasks with a known iterative solution, involving multiple iterations of a RASP-L operation - a length-generalizable operation that can be expressed by a finite-sized Transformer. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.
Toward a Theory of Tokenization in LLMs
Rajaraman, Nived, Jiao, Jiantao, Ramchandran, Kannan
While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. When trained on data drawn from certain simple $k^{\text{th}}$-order Markov processes for $k > 1$, transformers exhibit a surprising phenomenon - in the absence of tokenization, they empirically fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. With this observation as starting point, we study the end-to-end cross-entropy loss achieved by transformers with and without tokenization. With the appropriate tokenization, we show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $k^{\text{th}}$-order Markov sources near optimally. Our analysis provides a justification for the use of tokenization in practice through studying the behavior of transformers on Markovian data.
Learning to Understand: Identifying Interactions via the Mobius Transform
Kang, Justin S., Erginbas, Yigit E., Butler, Landon, Pedarsani, Ramtin, Ramchandran, Kannan
One of the most fundamental problems in machine learning is finding interpretable representations of the functions we learn. The Mobius transform is a useful tool for this because its coefficients correspond to unique importance scores on sets of input variables. The Mobius Transform is strongly related (and in some cases equivalent) to the concept of Shapley value, which is a widely used game-theoretic notion of importance. This work focuses on the (typical) regime where the fraction of non-zero Mobius coefficients (and thus interactions between inputs) is small compared to the set of all $2^n$ possible interactions between $n$ inputs. When there are $K = O(2^{n \delta})$ with $\delta \leq \frac{1}{3}$ non-zero coefficients chosen uniformly at random, our algorithm exactly recovers the Mobius transform in $O(Kn)$ samples and $O(Kn^2)$ time with vanishing error as $K \rightarrow \infty$, the first non-adaptive algorithm to do so. We also uncover a surprising connection between group testing and the Mobius transform. In the case where all interactions are between at most $t = \Theta(n^{\alpha})$ inputs, for $\alpha < 0.409$, we are able to leverage results from group testing to provide the first algorithm that computes the Mobius transform in $O(Kt\log n)$ sample complexity and $O(K\mathrm{poly}(n))$ time with vanishing error as $K \rightarrow \infty$. Finally, we present a robust version of this algorithm that achieves the same sample and time complexity under some assumptions, but with a factor depending on noise variance. Our work is deeply interdisciplinary, drawing from tools spanning across signal processing, algebra, information theory, learning theory and group testing to address this important problem at the forefront of machine learning.