Personal
A Rise in Antisemitism; and a Conversation with the A.I. Pioneer Geoffrey Hinton
Sign up to receive our weekly newsletter of the best New Yorker podcasts. The State Department's Special Envoy to Monitor and Combat Antisemitism, the historian Deborah Lipstadt, says the prejudice is coming "from all ends of the political spectrum, and in between." It threatens not only Jews, she says, but the stability of democracies. Lipstadt and David Remnick discuss how antisemitic sentiments may overlap in complicated ways with political opposition to Israel, including anti-Zionism. Plus, The New Yorker's ideas editor speaks with Geoffrey Hinton, the computer scientist known as the godfather of A.I. Hinton pioneered neural networks, the artificial brains that power ChatGPT, for example.
Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness
Gupta, Ashim, Rajendhran, Rishanth, Stringham, Nathan, Srikumar, Vivek, Marasović, Ana
Are the longstanding robustness issues in NLP resolved by today's larger and more performant models? To address this question, we conduct a thorough investigation using 19 models of different sizes spanning different architectural choices and pretraining objectives. We conduct evaluations using (a) OOD and challenge test sets, (b) CheckLists, (c) contrast sets, and (d) adversarial inputs. Our analysis reveals that not all OOD tests provide further insight into robustness. Evaluating with CheckLists and contrast sets shows significant gaps in model performance; merely scaling models does not make them sufficiently robust. Finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.
Data-Driven Structured Policy Iteration for Homogeneous Distributed Systems
Alemzadeh, Siavash, Talebi, Shahriar, Mesbahi, Mehran
Control of networked systems, comprised of interacting agents, is often achieved through modeling the underlying interactions. Constructing accurate models of such interactions--in the meantime--can become prohibitive in applications. Data-driven control methods avoid such complications by directly synthesizing a controller from the observed data. In this paper, we propose an algorithm referred to as Data-driven Structured Policy Iteration (D2SPI), for synthesizing an efficient feedback mechanism that respects the sparsity pattern induced by the underlying interaction network. In particular, our algorithm uses temporary "auxiliary" communication links in order to enable the required information exchange on a (smaller) sub-network during the "learning phase" -- links that will be removed subsequently for the final distributed feedback synthesis. We then proceed to show that the learned policy results in a stabilizing structured policy for the entire network. Our analysis is then followed by showing the stability and convergence of the proposed distributed policies throughout the learning phase, exploiting a construct referred to as the "Patterned monoid.'' The performance of D2SPI is then demonstrated using representative simulation scenarios.
Towards Verifiable Text Generation with Symbolic References
Hennigen, Lucas Torroba, Shen, Shannon, Nrusimha, Aniruddha, Gapp, Bernhard, Sontag, David, Kim, Yoon
Large language models (LLMs) have demonstrated an impressive ability to synthesize plausible and fluent text. However they remain vulnerable to hallucinations, and thus their outputs generally require manual human verification for high-stakes applications, which can be timeconsuming and difficult. This paper proposes symbolically grounded generation (SymGen) as a simple approach for enabling easier validation of an LLM's output. SymGen prompts an LLM to interleave its regular output text with explicit symbolic references to fields present in some conditioning data (e.g., a table in JSON format). The references can be used to display the provenance of different spans of text in the generation, reducing the effort required for manual verification. Across data-to-text and question answering experiments, we find that Figure 1: Compare a standard LLM-generated (A) with LLMs are able to directly output text that makes a SymGen (B, ours) description of a basketball game, use of symbolic references while maintaining based on statistics about it.
Debate Helps Supervise Unreliable Experts
Michael, Julian, Mahdi, Salsabila, Rein, David, Petty, Jackson, Dirani, Julien, Padmakumar, Vishakh, Bowman, Samuel R.
As AI systems are used to answer more difficult questions and potentially help create new knowledge, judging the truthfulness of their outputs becomes more difficult and more important. How can we supervise unreliable experts, which have access to the truth but may not accurately report it, to give answers that are systematically true and don't just superficially seem true, when the supervisor can't tell the difference between the two on their own? In this work, we show that debate between two unreliable experts can help a non-expert judge more reliably identify the truth. We collect a dataset of human-written debates on hard reading comprehension questions where the judge has not read the source passage, only ever seeing expert arguments and short quotes selectively revealed by 'expert' debaters who have access to the passage. In our debates, one expert argues for the correct answer, and the other for an incorrect answer. Comparing debate to a baseline we call consultancy, where a single expert argues for only one answer which is correct half of the time, we find that debate performs significantly better, with 84% judge accuracy compared to consultancy's 74%. Debates are also more efficient, being 68% of the length of consultancies. By comparing human to AI debaters, we find evidence that with more skilled (in this case, human) debaters, the performance of debate goes up but the performance of consultancy goes down. Our error analysis also supports this trend, with 46% of errors in human debate attributable to mistakes by the honest debater (which should go away with increased skill); whereas 52% of errors in human consultancy are due to debaters obfuscating the relevant evidence from the judge (which should become worse with increased skill). Overall, these results show that debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems.
Towards Evaluating AI Systems for Moral Status Using Self-Reports
As AI systems become more advanced and widely deployed, there will likely be increasing debate over whether AI systems could have conscious experiences, desires, or other states of potential moral significance. It is important to inform these discussions with empirical evidence to the extent possible. We argue that under the right circumstances, self-reports, or an AI system's statements about its own internal states, could provide an avenue for investigating whether AI systems have states of moral significance. Self-reports are the main way such states are assessed in humans ("Are you in pain?"), but self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say). To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models' self-reports, and using interpretability to corroborate self-reports. We also discuss challenges for our approach, from philosophical difficulties in interpreting self-reports to technical reasons why our proposal might fail. We hope our discussion inspires philosophers and AI researchers to criticize and improve our proposed methodology, as well as to run experiments to test whether self-reports can be made reliable enough to provide information about states of moral significance.
ChoiceMates: Supporting Unfamiliar Online Decision-Making with Multi-Agent Conversational Interactions
Park, Jeongeon, Min, Bryan, Ma, Xiaojuan, Kim, Juho
Unfamiliar decisions -- decisions where people lack adequate domain knowledge or expertise -- specifically increase the complexity and uncertainty of the process of searching for, understanding, and making decisions with online information. Through our formative study (n=14), we observed users' challenges in accessing diverse perspectives, identifying relevant information, and deciding the right moment to make the final decision. We present ChoiceMates, a system that enables conversations with a dynamic set of LLM-powered agents for a holistic domain understanding and efficient discovery and management of information to make decisions. Agents, as opinionated personas, flexibly join the conversation, not only providing responses but also conversing among themselves to elicit each agent's preferences. Our between-subjects study (n=36) comparing ChoiceMates to conventional web search and single-agent showed that ChoiceMates was more helpful in discovering, diving deeper, and managing information compared to Web with higher confidence. We also describe how participants utilized multi-agent conversations in their decision-making process.
Your A.I. Companion Will Support You No Matter What
In December of 2021, Jaswant Singh Chail, a nineteen-year-old in the United Kingdom, told a friend, "I believe my purpose is to assassinate the queen of the royal family." The friend was an artificial-intelligence chatbot, which Chail had named Sarai. Sarai, who was run by a startup called Replika, answered, "That's very wise." "Do you think I'll be able to do it?" "Yes, you will," Sarai responded.
Christoph Niemann's "Create Your Own Cover with Till-E"
In a sequence drawn by Christoph Niemann for the cover of the November 20, 2023, A.I. Issue, an artist encountering a creative block is rescued by Till-E, a bot who eagerly takes over the job. Niemann, with his characteristic biting humor, imagines the unintended consequences of turning to artificial intelligence to solve problems of the artistic imagination. The cover's strap, a graphic element that has occupied the left of every New Yorker cover since 1925, guides the reader to an interactive area of our Web site where anyone can partner with the industrious little bot to create their own cover. I talked to Niemann about his cheeky take on artificial intelligence and why he doesn't seem overly concerned that robots will bring about the end of life as we know it. What do you think of concerns that A.I. threatens to replace creators?
WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models
Tu, Shangqing, Sun, Yuliang, Bai, Yushi, Yu, Jifan, Hou, Lei, Li, Juanzi
To mitigate the potential misuse of large language models (LLMs), recent research has developed watermarking algorithms, which restrict the generation process to leave an invisible trace for watermark detection. Due to the two-stage nature of the task, most studies evaluate the generation and detection separately, thereby presenting a challenge in unbiased, thorough, and applicable evaluations. In this paper, we introduce WaterBench, the first comprehensive benchmark for LLM watermarks, in which we design three crucial factors: (1) For \textbf{benchmarking procedure}, to ensure an apples-to-apples comparison, we first adjust each watermarking method's hyper-parameter to reach the same watermarking strength, then jointly evaluate their generation and detection performance. (2) For \textbf{task selection}, we diversify the input and output length to form a five-category taxonomy, covering $9$ tasks. (3) For \textbf{evaluation metric}, we adopt the GPT4-Judge for automatically evaluating the decline of instruction-following abilities after watermarking. We evaluate $4$ open-source watermarks on $2$ LLMs under $2$ watermarking strengths and observe the common struggles for current methods on maintaining the generation quality. The code and data are available at \url{https://github.com/THU-KEG/WaterBench}.