nonsense
Trustworthy Retrosynthesis: Eliminating Hallucinations with a Diverse Ensemble of Reaction Scorers
Sadowski, Michal, Radusinović, Tadija, Wyrzykowska, Maria, Sztukiewicz, Lukasz, Rzymkowski, Jan, Włodarczyk-Pruszyński, Paweł, Sacha, Mikołaj, Kozakowski, Piotr, van Workum, Ruard, Jastrzebski, Stanislaw Kamil
Retrosynthesis is one of the domains transformed by the rise of generative models, and it is one where the problem of nonsensical or erroneous outputs (hallucinations) is particularly insidious: reliable assessment of synthetic plans is time-consuming, with automatic methods lacking. In this work, we present RetroTrim, a retrosynthesis system that successfully avoids nonsensical plans on a set of challenging drug-like targets. Compared to common baselines in the field, our system is not only the sole method that succeeds in filtering out hallucinated reactions, but it also results in the highest number of high-quality paths overall. The key insight behind RetroTrim is the combination of diverse reaction scoring strategies, based on machine learning models and existing chemical databases. We show that our scoring strategies capture different classes of hallucinations by analyzing them on a dataset of labeled retrosynthetic intermediates. This approach formed the basis of our winning solution to the Standard Industries \$1 million Retrosynthesis Challenge. To measure the performance of retrosynthesis systems, we propose a novel evaluation protocol for reactions and synthetic paths based on a structured review by expert chemists. Using this protocol, we compare systems on a set of 32 novel targets, curated to reflect recent trends in drug structures. While the insights behind our methodology are broadly applicable to retrosynthesis, our focus is on targets in the drug-like domain. By releasing our benchmark targets and the details of our evaluation protocol, we hope to inspire further research into reliable retrosynthesis.
Sequences of Logits Reveal the Low Rank Structure of Language Models
Golowich, Noah, Liu, Allen, Shetty, Abhishek
A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model's logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation -- in particular, we can generate a response to a target prompt using a linear combination of the model's outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees.
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
Wang, Yang, Xiao, Chenghao, Hsiao, Chia-Yi, Chang, Zi Yan, Chen, Chi-Li, Loakman, Tyler, Lin, Chenghua
We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth" - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a benchmark dataset of over 1,200+ meticulously curated and diverse examples across English, Mandarin, Spanish, French, Japanese, and Korean. Each example underwent careful expert review to verify its Drivelological characteristics, involving multiple rounds of discussion and adjudication to address disagreements. Using this dataset, we evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss implied rhetorical functions altogether. These findings highlight a deep representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.
Five things you need to know about AI right now
The video is now available (thank you, SXSW London). Below is a quick look at my top five. Let me know if you would have picked different ones! Maybe you think that's obvious. But I am constantly having to check my assumptions about how fast this technology is progressing--and it's my job to keep up.
The Two Word Test: A Semantic Benchmark for Large Language Models
Riccardi, Nicholas, Desai, Rutvik H.
Large Language Models (LLMs) have shown remarkable abilities recently, including passing advanced professional exams and demanding benchmark tests. This performance has led many to suggest that they are close to achieving humanlike or 'true' understanding of language, and even Artificial General Intelligence (AGI). Here, we provide a new open-source benchmark that can assess semantic abilities of LLMs using two-word phrases using a task that can be performed relatively easily by humans without advanced training. Combining multiple words into a single concept is a fundamental aspect of human language and intelligence. The test requires meaningfulness judgments of 1768 noun-noun combinations that have been rated as meaningful (e.g., baby boy) or not meaningful (e.g., goat sky). by 150 human raters. We provide versions of the task that probe meaningfulness ratings on a 0-4 scale as well as binary judgments. We conducted a series of experiments using the TWT on GPT-4, GPT-3.5, and Bard, with both versions. Results demonstrated that, compared to humans, all models perform poorly at rating meaningfulness of these phrases. GPT-3.5 and Bard are also unable to make binary discriminations between sensible and nonsense phrases as making sense. GPT-4 makes a substantial improvement in binary discrimination of combinatorial phrases but is still significantly worse than human performance. The TWT can be used to understand the limitations and weaknesses of current LLMs, and potentially improve them. The test also reminds us that caution is warranted in attributing 'true understanding' or AGI to LLMs. TWT is available at: https://github.com/NickRiccardi/two-word-test
Google improves Bard to compete with ChatGPT: here's what's new
Google has recently improved its AI chatbot, Bard, in an effort to rival its competitor, ChatGPT. The tech giant has optimized the AI responses in some areas and made improvements to the chatbot's abilities in mathematics and logic. The first feedback on Bard was not positive, with testers criticizing the many restrictions put in place by Google. In response, the company padlocked the experience to avoid abuses. To address the limitations of Bard, Google has pledged to make improvements to its artificial intelligence.
GPT-4 Has the Memory of a Goldfish
By this point, the many defects of AI-based language models have been analyzed to death--their incorrigible dishonesty, their capacity for bias and bigotry, their lack of common sense. GPT-4, the newest and most advanced such model yet, is already being subjected to the same scrutiny, and it still seems to misfire in pretty much all the ways earlier models did. But large language models have another shortcoming that has so far gotten relatively little attention: their shoddy recall. These multibillion-dollar programs, which require several city blocks' worth of energy to run, may now be able to code websites, plan vacations, and draft company-wide emails in the style of William Faulkner. But they have the memory of a goldfish.
AutoReply: Detecting Nonsense in Dialogue Introspectively with Discriminative Replies
Shi, Weiyan, Dinan, Emily, Renduchintala, Adi, Fried, Daniel, Jacob, Athul Paul, Yu, Zhou, Lewis, Mike
Existing approaches built separate classifiers to detect nonsense in dialogues. In this paper, we show that without external classifiers, dialogue models can detect errors in their own messages introspectively, by calculating the likelihood of replies that are indicative of poor messages. For example, if an agent believes its partner is likely to respond "I don't understand" to a candidate message, that message may not make sense, so an alternative message should be chosen. We evaluate our approach on a dataset from the game Diplomacy, which contains long dialogues richly grounded in the game state, on which existing models make many errors. We first show that hand-crafted replies can be effective for the task of detecting nonsense in applications as complex as Diplomacy. We then design AutoReply, an algorithm to search for such discriminative replies automatically, given a small number of annotated dialogue examples. We find that AutoReply-generated replies outperform handcrafted replies and perform on par with carefully fine-tuned large supervised models. Results also show that one single reply without much computation overheads can also detect dialogue nonsense reasonably well.
Productizing Large Language Models
Large Language Models (LLMs) are known for their near-magical ability to learn from very few examples -- as little as zero -- to create language wonders. LLMs can chat, write poetry, write code, and even do basic arithmetic. However, the same properties that make LLMs magical also make them challenging from an engineering perspective. At Replit we have deployed transformer-based language models of all sizes: 100m parameter models for search and spam, 1-10B models for a code autocomplete product we call GhostWriter, and 100B models for features that require a higher reasoning ability. In this post we'll talk about what we've learned about building and hosting large language models.
How Would an AI Chatbot Handle the Complexities of Oral Language?
Joseph Wilson, a linguist and journalist who has done considerable work with oral languages (languages not yet written down), offers some thoughts on claims that chatbots like Blake Lemoine's LaMDA, really speak like human persons. But this excludes all unwritten forms of communication: sign language, oral histories, body language, tone of voice, and the broader cultural context in which people find themselves speaking. In other words, it leaves out much of the interesting stuff that makes nuanced communication between people possible. We really don't know how old spoken language is (Wilson suggests 50,000 years) but written language can be traced only as far back as about 5400 years ago. And only about half of all languages (he estimates 7100 currently) have ever been written down.