Tang, Zhisheng
Humanlike Cognitive Patterns as Emergent Phenomena in Large Language Models
Tang, Zhisheng, Kejriwal, Mayank
Research on emergent patterns in Large Language Models (LLMs) has gained significant traction in both psychology and artificial intelligence, motivating the need for a comprehensive review that offers a synthesis of this complex landscape. In this article, we systematically review LLMs' capabilities across three important cognitive domains: decision-making biases, reasoning, and creativity. We use empirical studies drawing on established psychological tests and compare LLMs' performance to human benchmarks. On decision-making, our synthesis reveals that while LLMs demonstrate several human-like biases, some biases observed in humans are absent, indicating cognitive patterns that only partially align with human decision-making. On reasoning, advanced LLMs like GPT-4 exhibit deliberative reasoning akin to human System-2 thinking, while smaller models fall short of human-level performance. A distinct dichotomy emerges in creativity: while LLMs excel in language-based creative tasks, such as storytelling, they struggle with divergent thinking tasks that require real-world context. Nonetheless, studies suggest that LLMs hold considerable potential as collaborators, augmenting creativity in human-machine problem-solving settings. Discussing key limitations, we also offer guidance for future research in areas such as memory, attention, and open-source model development.
GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning
Tang, Zhisheng, Kejriwal, Mayank
Spatial reasoning, an important faculty of human cognition with many practical applications, is one of the core commonsense skills that is not purely language-based and, for satisfying (as opposed to optimal) solutions, requires some minimum degree of planning. Existing benchmarks of Commonsense Spatial Reasoning (CSR) tend to evaluate how Large Language Models (LLMs) interpret text-based spatial descriptions rather than directly evaluate a plan produced by the LLM in response to a spatial reasoning scenario. In this paper, we construct a large-scale benchmark called $\textbf{GRASP}$, which consists of 16,000 grid-based environments where the agent is tasked with an energy collection problem. These environments include 100 grid instances instantiated using each of the 160 different grid settings, involving five different energy distributions, two modes of agent starting position, and two distinct obstacle configurations, as well as three kinds of agent constraints. Using GRASP, we compare classic baseline approaches, such as random walk and greedy search methods, with advanced LLMs like GPT-3.5-Turbo and GPT-4o. The experimental results indicate that even these advanced LLMs struggle to consistently achieve satisfactory solutions.
An Evaluation of Estimative Uncertainty in Large Language Models
Tang, Zhisheng, Shen, Ke, Kejriwal, Mayank
Words of estimative probability (WEPs), such as ''maybe'' or ''probably not'' are ubiquitous in natural language for communicating estimative uncertainty, compared with direct statements involving numerical probability. Human estimative uncertainty, and its calibration with numerical estimates, has long been an area of study -- including by intelligence agencies like the CIA. This study compares estimative uncertainty in commonly used large language models (LLMs) like GPT-4 and ERNIE-4 to that of humans, and to each other. Here we show that LLMs like GPT-3.5 and GPT-4 align with human estimates for some, but not all, WEPs presented in English. Divergence is also observed when the LLM is presented with gendered roles and Chinese contexts. Further study shows that an advanced LLM like GPT-4 can consistently map between statistical and estimative uncertainty, but a significant performance gap remains. The results contribute to a growing body of research on human-LLM alignment.
A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning
Tang, Zhisheng, Kejriwal, Mayank
An early popular example is the Bidirectional Encoder Representations from Transformer (BERT) model [2], which soon led to many domain-specific variants, as well as a more optimized version that was able to yield significant improvements without major changes to the original BERT architecture [3]. Perhaps because of its success, researchers have been attempting to empirically understand the properties (including biases and blind spots [4]) of even early transformer models such as BERT, along multiple dimensions [5-7]. While these tests, some of which have been adversarial by design, have revealed some problems, a growing body of research also shows that these models have achieved truly impressive, non-incremental performance advances on various natural language understanding problems [8]. While it can be convenient to overweight mistakes by the models, especially if the mistakes are'un-humanlike' and made in seemingly simple situations, and to dismiss them as incapable of semantics or symbolic processing, such commentating potentially opens the door to confirmation bias. We are not denying the utility of critical and adversarial testing of such models [9,10]; however, we do caution that there is a danger of their interpretations being taken out of context. Arguably, the latest transformer models, such as ChatGPT and DALL-E, captured the public spotlight by being able to process relatively complex human inputs with unprecedented skill [11]. They have also ignited an AI arms race of sorts between large technology corporations. Some of this discourse is hyped, but some could be argued to be justified as correctly describing a major leap in AI progress, at least in an empirical sense [12, 13].