alice
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > Strength High (0.68)
- Law (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues
Yu, Fangxu, Jiang, Lai, Huang, Shenyi, Wu, Zhen, Dai, Xinyu
The ability to understand and predict the mental states of oneself and others, known as the Theory of Mind (ToM), is crucial for effective social interactions. Recent research has emerged to evaluate whether Large Language Models (LLMs) exhibit a form of ToM. Although recent studies have evaluated ToM in LLMs, existing benchmarks focus predominantly on physical perception with principles guided by the Sally-Anne test in synthetic stories and conversations, failing to capture the complex psychological activities of mental states in real-life social interactions. To mitigate this gap, we propose PersuasiveToM, a benchmark designed to evaluate the ToM abilities of LLMs in persuasive dialogues. Our framework introduces two categories of questions: (1) ToM Reasoning, assessing the capacity of LLMs to track evolving mental states (e.g., desire shifts in persuadees), and (2) ToM Application, evaluating whether LLMs can take advantage of inferred mental states to select effective persuasion strategies (e.g., emphasize rarity) and evaluate the effectiveness of persuasion strategies. Experiments across eight state-of-the-art LLMs reveal that while models excel on multiple questions, they struggle to answer questions that need tracking the dynamics and shifts of mental states and understanding the mental states in the whole dialogue comprehensively. Our aim with PersuasiveToM is to allow an effective evaluation of the ToM reasoning ability of LLMs with more focus on complex psychological activities. Our code is available at https://github.com/Yu-Fangxu/PersuasiveToM.
- Asia > China (0.28)
- North America > United States > California (0.14)
- Europe > France (0.14)
- Education (0.46)
- Health & Medicine (0.46)
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents
Sun, Haochen, Zhang, Shuwen, Ren, Lei, Xu, Hao, Fu, Hao, Yuan, Caixia, Wang, Xiaojie
Large language models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-powered Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks from two novel perspectives. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments over 10 popular LLMs and show that, while the LLMs present a strong ability in goal interpretation, there is a significant discrepancy in active collaboration and continuous adaption that are critical for efficiently fulfilling complicated tasks. Notably, we highlight the strengths and weaknesses in LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-sourced benchmark. Environments, 30 open-ended tasks, and an integrated evaluation package are now publicly available at https://github.com/YusaeMeow/Collab-Overcooked.
- Workflow (1.00)
- Research Report (1.00)
It's Not All Black and White: Degree of Truthfulness for Risk-Avoiding Agents
Hartman, Eden, Segal-Halevi, Erel, Tao, Biaoshuai
The classic notion of truthfulness requires that no agent has a profitable manipulation -- an untruthful report that, for some combination of reports of the other agents, increases her utility. This strong notion implicitly assumes that the manipulating agent either knows what all other agents are going to report, or is willing to take the risk and act as-if she knows their reports. Without knowledge of the others' reports, most manipulations are risky -- they might decrease the manipulator's utility for some other combinations of reports by the other agents. Accordingly, a recent paper (Bu, Song and Tao, ``On the existence of truthful fair cake cutting mechanisms'', Artificial Intelligence 319 (2023), 103904) suggests a relaxed notion, which we refer to as risk-avoiding truthfulness (RAT), which requires only that no agent can gain from a safe manipulation -- one that is sometimes beneficial and never harmful. Truthfulness and RAT are two extremes: the former considers manipulators with complete knowledge of others, whereas the latter considers manipulators with no knowledge at all. In reality, agents often know about some -- but not all -- of the other agents. This paper introduces the RAT-degree of a mechanism, defined as the smallest number of agents whose reports, if known, may allow another agent to safely manipulate, or $n$ if there is no such number. This notion interpolates between classic truthfulness (degree $n$) and RAT (degree at least $1$): a mechanism with a higher RAT-degree is harder to manipulate safely. To illustrate the generality and applicability of this concept, we analyze the RAT-degree of prominent mechanisms across various social choice settings, including auctions, indivisible goods allocations, cake-cutting, voting, and stable matchings.
- North America > United States (0.28)
- Asia > China (0.14)
- Europe > Greece (0.14)
- Europe > Germany (0.14)
Verifying Classification with Limited Disclosure
Bhandari, Siddharth, Shan, Liren
We consider the multi-party classification problem introduced by Dong, Hartline, and Vijayaraghavan (2022) motivated by electronic discovery. In this problem, our goal is to design a protocol that guarantees the requesting party receives nearly all responsive documents while minimizing the disclosure of nonresponsive documents. We develop verification protocols that certify the correctness of a classifier by disclosing a few nonresponsive documents. We introduce a combinatorial notion called the Leave-One-Out dimension of a family of classifiers and show that the number of nonresponsive documents disclosed by our protocol is at most this dimension in the realizable setting, where a perfect classifier exists in this family. For linear classifiers with a margin, we characterize the trade-off between the margin and the number of nonresponsive documents that must be disclosed for verification. Specifically, we establish a trichotomy in this requirement: for $d$ dimensional instances, when the margin exceeds $1/3$, verification can be achieved by revealing only $O(1)$ nonresponsive documents; when the margin is exactly $1/3$, in the worst case, at least $\Omega(d)$ nonresponsive documents must be disclosed; when the margin is smaller than $1/3$, verification requires $\Omega(e^d)$ nonresponsive documents. We believe this result is of independent interest with applications to coding theory and combinatorial geometry. We further extend our protocols to the nonrealizable setting defining an analogous combinatorial quantity robust Leave-One-Out dimension, and to scenarios where the protocol is tolerant to misclassification errors by Alice.
Review for NeurIPS paper: Assisted Learning: A Framework for Multi-Organization Learning
Weaknesses: The paper states that model selection or model averaging approaches will not significantly improve over the best of the models (Alice's or Bob's) used in the assisted learning procedure because they fail to utilize the full data (the union of Alice's and Bob's features). However, ensemble techniques such as stacked regression (Breiman 1996) are often successfully used to improve predictive performance by combining not only different models trained on the same set of features, but also by combining different models trained on different subsets of features. In all experiments performed in the paper, only comparisons between assisted learning and the oracle model were presented. The paper would be considerably stronger if it was able to show that assisted learning compared favorably against (for instance) a stacked model generated with the predictions obtained from the different models on modules M_1, …, M_m (trained with the original public responses). Note that under the assumptions made by the paper, that the labels/response (as well as, some sort of identifier needed to collate the labels/response to the features) are public available, a simpler ensemble approach (such as stacking) could also be directly used to improve learning without sharing the private feature data.