Goto

Collaborating Authors

 Turing's Test


Thinking Machines Cofounder's Office Relationship Preceded His Termination

WIRED

Leaders at Mira Murati's startup believe Barret Zoph engaged in an incident of "serious misconduct." The details are now coming to light. Leaders at Mira Murati's Thinking Machines Lab confronted the startup's cofounder and former CTO, Barret Zoph, over an alleged relationship with another employee last summer, WIRED has learned. That relationship was likely the alleged "misconduct" that has been mentioned in prior reporting, including by WIRED . To protect the privacy of the individuals involved, WIRED is not naming the employee in question.


Turing Test 2.0: The General Intelligence Threshold

Mappouras, Georgios

arXiv.org Artificial Intelligence

With the rise of artificial intelligence (A.I.) and large language models like ChatGPT, a new race for achieving artificial general intelligence (A.G.I) has started. While many speculate how and when A.I. will achieve A.G.I., there is no clear agreement on how A.G.I. can be detected in A.I. models, even when popular tools like the Turing test (and its modern variations) are used to measure their intelligence. In this work, we discuss why traditional methods like the Turing test do not suffice for measuring or detecting A.G.I. and provide a new, practical method that can be used to decide if a system (computer or any other) has reached or surpassed A.G.I. To achieve this, we make two new contributions. First, we present a clear definition for general intelligence (G.I.) and set a G.I. Threshold (G.I.T.) that can be used to distinguish between systems that achieve A.G.I. and systems that do not. Second, we present a new framework on how to construct tests that can detect if a system has achieved G.I. in a simple, comprehensive, and clear-cut fail/pass way. We call this novel framework the Turing test 2.0. We then demonstrate real-life examples of applying tests that follow our Turing test 2.0 framework on modern A.I. models.


In Defense of the Turing Test and its Legacy

Gonçalves, Bernardo

arXiv.org Artificial Intelligence

Considering that Turing's original test was co-opted by Weizenbaum and that six of the most common criticisms of the Turing test are unfair to both Turing's argument and the historical development of AI. The Turing test has faced criticism for decades, most recently at the Royal Society event "Celebrating the 75th Anniversary of the Turing Test." The question of the Turing test's significance has intensified with recent advances in large language model technology, which now enable machines to pass it. In this article, I address six of the most common criticisms of the Turing test: The Turing test encourages fooling people; Turing overestimated human intelligence, as people can be easily fooled (the ELIZA effect); The Turing test is not a good benchmark for AI; Turing's 1950 paper is not serious and/or has contradictions; Imitation should not be a goal for AI, and it is also harmful to society; Passing the Turing test teaches nothing about AI. All six criticisms largely derive from Joseph Weizenbaum's influential reinterpretation of the Turing test. The first four fail to withstand a close examination of the internal logic of Turing's 1950 paper, particularly when the paper is situated within its mid-twentieth-century context.


Normality and the Turing Test

Kabbach, Alexandre

arXiv.org Artificial Intelligence

This paper proposes to revisit the Turing test through the concept of normality. Its core argument is that the Turing test is a test of normal intelligence as assessed by a normal judge. First, in the sense that the Turing test targets normal/average rather than exceptional human intelligence, so that successfully passing the test requires machines to "make mistakes" and display imperfect behavior just like normal/average humans. Second, in the sense that the Turing test is a statistical test where judgments of intelligence are never carried out by a single "average" judge (understood as non-expert) but always by a full jury. As such, the notion of "average human interrogator" that Turing talks about in his original paper should be understood primarily as referring to a mathematical abstraction made of the normalized aggregate of individual judgments of multiple judges. Its conclusions are twofold. First, it argues that large language models such as ChatGPT are unlikely to pass the Turing test as those models precisely target exceptional rather than normal/average human intelligence. As such, they constitute models of what it proposes to call artificial smartness rather than artificial intelligence, insofar as they deviate from the original goal of Turing for the modeling of artificial minds. Second, it argues that the objectivization of normal human behavior in the Turing test fails due to the game configuration of the test which ends up objectivizing normative ideals of normal behavior rather than normal behavior per se.


Perfect AI Mimicry and the Epistemology of Consciousness: A Solipsistic Dilemma

Li, Shurui

arXiv.org Artificial Intelligence

Rapid advances in artificial intelligence necessitate a re - examination of the epistemological foundations upon which we attribute consciousness. As AI systems increasingly mimic human behavior and interaction with high fidelity, the concept of a "perfect m imic" -- an entity empirically indistinguishable from a human through observation and interaction -- shifts from hypothetical to technologically plausible. This paper argues that such developments pose a fundamental challenge to the consistency of our mind - recog nition practices. Consciousness attributions rely heavily, if not exclusively, on empirical evidence derived from behavior and interaction. If a perfect mimic provides evidence identical to that of humans, any refusal to grant it equivalent epistemic statu s must invoke inaccessible factors, such as qualia, substrate requirements, or origin. Selectively invoking such factors risks a debilitating dilemma: either we undermine the rational basis for attributing consciousness to others (epistemological solipsism), or we accept inconsistent reasoning. I contend that epistemic consistency demands we ascribe the same status to empirically indistinguishable entities, regardless of metaphysical assumptions. The perfect mimic thus acts as an epistemic mirror, forcing c ritical reflection on the assumptions underlying intersubjective recognition in light of advancing AI. This analysis carries significant implications for theories of consciousness and ethical frameworks concerning artificial agents .


Exclusive: Mira Murati's Stealth AI Lab Launches Its First Product

WIRED

Thinking Machines Lab, led by a group of prominent former OpenAI researchers, is betting that fine-tuning cutting-edge models will be the next frontier in AI. Thinking Machines Lab, a heavily funded startup cofounded by prominent researchers from OpenAI, has revealed its first product--a tool called Tinker that automates the creation of custom frontier AI models. "We believe [Tinker] will help empower researchers and developers to experiment with models and will make frontier capabilities much more accessible to all people," said Mira Murati, cofounder and CEO of Thinking Machines, in an interview with WIRED ahead of the announcement. Big companies and academic labs already fine-tune open source AI models to create new variants that are optimized for specific tasks, like solving math problems, drafting legal agreements, or answering medical questions. Typically, this work involves acquiring and managing clusters of GPUs and using various software tools to ensure that large-scale training runs are stable and efficient.


ChatGPT-generated texts show authorship traits that identify them as non-human

Dentella, Vittoria, Huang, Weihang, Mansi, Silvia Angela, Grieve, Jack, Leivada, Evelina

arXiv.org Artificial Intelligence

Large Language Models can emulate different writing styles, ranging from composing poetry that appears indistinguishable from that of famous poets to using slan g that can convince people that they are chatting with a human online . While differences in style may not always be visible to the untrained eye, we can generally distinguish the writing of different people, like a linguistic fingerprint. This work examines whether a language model can also be linked to a specific fingerprint . Through stylometric and multidimensional register analys e s, w e compare human - authored and model - authored texts from different registers. We find that the model can successfully adapt its style depending on whether it is prompted to produce a Wikipedia entry vs. a college essay, but not in a way that makes it indistinguishable from human s . Concretely, the model shows more limited variation when producing outputs in different registers. O ur results suggest that the model prefers nouns to verbs, thus showing a distinct linguistic backbone from humans, who tend to anchor language in the highly grammaticalized dimensions of tense, aspect, and mood . It is possible that the more complex domains of grammar reflect a mode of thought unique to humans, thus acting as a litmus test for Artificial Intelligence. 2 Introduction Scholars from different disciplines have been addressing the question of what makes us human for centuries. For Nobel laureate Bertrand Russell, the answer is language, for "no matter how eloquently a dog may bark, he cannot tell you that his parents were poor but honest". H uman language is both flexible and constrained at the same time, and this is why the Turing Test, described as a litmus test for Artificial Intelligence [ Shieber 199 4, French 200 0], is linked to achieving a level of conversational proficiency that is highly complex, akin to that of a human [ Turing 1950 ] . Human language is flexible in the sense that we all make different choices when conversing. Every human is thought t o have a distinct linguistic fingerprint called idiolect [ Halliday et al. 196 4, Coulthard 2004 ] . This idiolect, which can be defined as an individual's unique use of linguistic forms (including lexical choices, collocations and fixed expressions, punctuation patterns, misspellings, and grammatical style), is critical for authorship attribution in a range of situations: from identifying that a poem with dashes, elliptical syntax, and unconventional capitalization is more likely authored by Emily Dickinson and not by William Shakespeare, to pinning down a person of interest in the course of a criminal investigation, as happened in the Unabomber case .


The next question after Turing's question: Introducing the Grow-AI test

Tugui, Alexandru

arXiv.org Artificial Intelligence

This study aims to extend the framework for assessing artificial intelligence, called GROW-AI (Growth and Realization of Autonomous Wisdom), designed to answer the question "Can machines grow up?" -- a natural successor to the Turing Test. The methodology applied is based on a system of six primary criteria (C1-C6), each assessed through a specific "game", divided into four arenas that explore both the human dimension and its transposition into AI. All decisions and actions of the entity are recorded in a standardized AI Journal, the primary source for calculating composite scores. The assessment uses the prior expert method to establish initial weights, and the global score -- Grow Up Index -- is calculated as the arithmetic mean of the six scores, with interpretation on maturity thresholds. The results show that the methodology allows for a coherent and comparable assessment of the level of "growth" of AI entities, regardless of their type (robots, software agents, LLMs). The multi-game structure highlights strengths and vulnerable areas, and the use of a unified journal guarantees traceability and replicability in the evaluation. The originality of the work lies in the conceptual transposition of the process of "growing" from the human world to that of artificial intelligence, in an integrated testing format that combines perspectives from psychology, robotics, computer science, and ethics. Through this approach, GROW-AI not only measures performance but also captures the evolutionary path of an AI entity towards maturity.


Meta's AI Recruiting Campaign Finds a New Target

WIRED

Mark Zuckerberg is on a warpath to recruit top talent in the AI field for his newly formed Meta Superintelligence Labs. After trying to gut OpenAI (and successfully poaching several top researchers), he appears to have set his sights on his next target. More than a dozen people at Mira Murati's 50-person startup, Thinking Machines Lab, have been approached or received offers from the tech giant. One of those offers was more than 1 billion over a multi-year span, a source with knowledge of the negotiations tells WIRED. The rest were between 200 million and 500 million over a four-year span, multiple sources confirm.


Dual Turing Test: A Framework for Detecting and Mitigating Undetectable AI

Messina, Alberto

arXiv.org Artificial Intelligence

In this short note, we propose a unified framework that bridges three areas: (1) a flipped perspective on the Turing Test, the "dual Turing test", in which a human judge's goal is to identify an AI rather than reward a machine for deception; (2) a formal adversarial classification game with explicit quality constraints and worst-case guarantees; and (3) a reinforcement learning (RL) alignment pipeline that uses an undetectability detector and a set of quality related components in its reward model. We review historical precedents, from inverted and meta-Turing variants to modern supervised reverse-Turing classifiers, and highlight the novelty of combining quality thresholds, phased difficulty levels, and minimax bounds. We then formalize the dual test: define the judge's task over N independent rounds with fresh prompts drawn from a prompt space Q, introduce a quality function Q and parameters tau and delta, and cast the interaction as a two-player zero-sum game over the adversary's feasible strategy set M. Next, we map this minimax game onto an RL-HF style alignment loop, in which an undetectability detector D provides negative reward for stealthy outputs, balanced by a quality proxy that preserves fluency. Throughout, we include detailed explanations of each component notation, the meaning of inner minimization over sequences, phased tests, and iterative adversarial training and conclude with a suggestion for a couple of immediate actions.