AITopics | Davis, Ernest

Collaborating Authors

Davis, Ernest

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Testing GPT-4-o1-preview on math and science problems: A follow-up study

Davis, Ernest

arXiv.org Artificial IntelligenceOct-11-2024

In August 2023, Scott Aaronson and I reported the results of testing GPT4 with the Wolfram Alpha and Code Interpreter plug-ins over a collection of 105 original high-school level and college-level science and math problems (Davis and Aaronson, 2023). In September 2024, I tested the recently released model GPT-4o1-preview on the same collection. Overall I found that performance had significantly improved, but was still considerably short of perfect. In particular, problems that involve spatial reasoning are often stumbling blocks. On September 12, OpenAI (2024) released two preliminary versions, "ChatGPT-o1-preview" and "ChatGPT-o1-mini" of a forthcoming product "ChatGPT-o1".

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.2234

Country: North America > United States (1.00)

Genre: Research Report (0.50)

Industry:

Education > Educational Setting (0.55)
Government > Space Agency (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems

Davis, Ernest, Aaronson, Scott

arXiv.org Artificial IntelligenceAug-14-2023

Our test sets were too small and too haphazard to support statistically valid conclusions, but they were suggestive of a number of conclusions. We summarize these here, and discuss them at greater length in section 7. Over the kinds of problems tested, GPT-4 with either plug-in is significantly stronger than GPT-4 by itself, or, almost certainly, than any AI that existed a year ago. However it is still far from reliable; it often outputs a wrong answer or fails to output any answer. In terms of overall score, we would judge that these systems performs on the level of a middling undergraduate student. However, their capacities and weaknesses do not align with a human student; the systems solve some problems that even capable students would find challenging, whereas they fail on some problems that even middling high school students would find easy.

calculation, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2308.05713

Country:

North America > United States > Texas (0.68)
North America > Canada (0.67)
North America > United States > California (0.67)
North America > United States > Illinois (0.46)

Genre: Research Report (0.41)

Industry: Education > Educational Setting > K-12 Education > Secondary School (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Benchmarks for Automated Commonsense Reasoning: A Survey

Davis, Ernest

arXiv.org Artificial IntelligenceFeb-22-2023

More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of artificial intelligence (AI) systems. However, these benchmarks are often flawed and many aspects of common sense remain untested. Consequently, we do not currently have any reliable way of measuring to what extent existing AI systems have achieved these abilities. This paper surveys the development and uses of AI commonsense benchmarks. We discuss the nature of common sense; the role of common sense in AI; the goals served by constructing commonsense benchmarks; and desirable features of commonsense benchmarks. We analyze the common flaws in benchmarks, and we argue that it is worthwhile to invest the work needed ensure that benchmark examples are consistently high quality. We survey the various methods of constructing commonsense benchmarks. We enumerate 139 commonsense benchmarks that have been developed: 102 text-based, 18 image-based, 12 video based, and 7 simulated physical environments. We discuss the gaps in the existing benchmarks and aspects of commonsense reasoning that are not addressed in any existing benchmark. We conclude with a number of recommendations for future development of commonsense AI benchmarks.

artificial intelligence, benchmark, survey article, (13 more...)

arXiv.org Artificial Intelligence

2302.04752

Country:

Europe (1.00)
North America > United States > California (0.28)

Genre: Research Report (0.84)

Industry:

Media > News (1.00)
Leisure & Entertainment > Games (0.92)
Health & Medicine (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Mathematics, word problems, common sense, and artificial intelligence

Davis, Ernest

arXiv.org Artificial IntelligenceJan-24-2023

The paper discusses the capacities and limitations of current artificial intelligence (AI) technology to solve word problems that combine elementary knowledge with commonsense reasoning. No existing AI systems can solve these reliably. We review three approaches that have been developed, using AI natural language technology: outputting the answer directly, outputting a computer program that solves the problem, and outputting a formalized representation that can be input to an automated theorem verifier. We review some benchmarks that have been developed to evaluate these systems and some experimental studies. We discuss the limitations of the existing technology at solving these kinds of problems. We argue that it is not clear whether these kinds of limitations will be important in developing AI technology for pure mathematical research, but that they will be important in applications of mathematics, and may well be important in developing programs capable of reading and understanding mathematical content written by humans.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2301.09723

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (0.34)
Research Report > Experimental Study (0.34)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.68)

Add feedback

The Defeat of the Winograd Schema Challenge

Kocijan, Vid, Davis, Ernest, Lukasiewicz, Thomas, Marcus, Gary, Morgenstern, Leora

arXiv.org Artificial IntelligenceJan-23-2023

The Winograd Schema Challenge - a set of twin sentences involving pronoun reference disambiguation that seem to require the use of commonsense knowledge - was proposed by Hector Levesque in 2011. By 2019, a number of AI systems, based on large pre-trained transformer-based language models and fine-tuned on these kinds of problems, achieved better than 90% accuracy. In this paper, we review the history of the Winograd Schema Challenge and discuss the lasting contributions of the flurry of research that has taken place on the WSC in the last decade. We discuss the significance of various datasets developed for WSC, and the research community's deeper understanding of the role of surrogate tasks in assessing the intelligence of an AI system.

artificial intelligence, machine learning, survey article, (16 more...)

arXiv.org Artificial Intelligence

2201.02387

Country:

North America > United States (1.00)
Europe (1.00)

Genre:

Overview (1.00)
Research Report (0.81)

Industry:

Education (0.93)
Government > Regional Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Physical Reasoning in an Open World

Zeng, Zhuoran, Davis, Ernest

arXiv.org Artificial IntelligenceJan-21-2022

Most work on physical reasoning, both in artificial intelligence and in cognitive science, has focused on closed-world reasoning, in which it is assumed that the problem specification specifies all relevant objects and substance, all their relations in an initial situation, and all exogenous events. However, in many situations, it is important to do open-world reasoning; that is, making valid conclusions from very incomplete information. We have implemented in Prolog an open-world reasoner for a toy microworld of containers that can be loaded, unloaded, sealed, unsealed, carried, and dumped.

artificial intelligence, logic & formal reasoning, neural network, (19 more...)

arXiv.org Artificial Intelligence

2201.0895

Country: North America > United States > New York (0.14)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.90)
Information Technology > Artificial Intelligence > Robots (0.68)

Add feedback

A Flawed Dataset for Symbolic Equation Verification

Davis, Ernest

arXiv.org Artificial IntelligenceMay-27-2021

Arabshahi, Singh, and Anandkumar (2018) propose a method for creating a dataset of symbolic mathematical equations for the tasks of symbolic equation verification and equation completion. Unfortunately, a dataset constructed using the method they propose will suffer from two serious flaws. First, the class of true equations that the procedure can generate will be very limited. Second, because true and false equations are generated in completely different ways, there are likely to be artifactual features that allow easy discrimination. Moreover, over the class of equations they consider, there is an extremely simple probabilistic procedure that solves the problem of equation verification with extremely high reliability. The usefulness of this problem in general as a testbed for AI systems is therefore doubtful.

artificial intelligence, equation, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2105.11479

Country: North America > United States > New York (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Unanswerable Questions about Images and Texts

Davis, Ernest

arXiv.org Artificial IntelligenceJan-25-2021

It will be useful to setting up a general, abstract framework in which to discuss these issues. Generally speaking AI systems, and for that matter computer programs of any kind for a particular task, the actual ultimate objective can be formulated as follows. There is a class X of inputs that are "reasonable" problems for Q. There is a class Y of possible outputs. The task defines a relation Q(x, y) meaning "y is a good output [or an acceptable output, or the best possible output] on the task for input x." We assume that for every x X there is at least one y Y such that Q(x, y).

chess, machine translation, unanswerable question, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.3389/frai.2020.00051

2102.06793

Country: North America > United States > New York (0.14)

Genre: Research Report (0.40)

Industry:

Health & Medicine (0.68)
Leisure & Entertainment > Games > Chess (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)

Add feedback

The First Winograd Schema Challenge at IJCAI-16

Davis, Ernest (New York University) | Morgenstern, Leora (Leidos) | Ortiz, Charles L. (Nuance Communications)

AI MagazineOct-2-2017

The First Winograd Schema Challenge at IJCAI-16

artificial intelligence, challenge, commonsense reasoning, (2 more...)

AI Magazine

Genre: Contests & Prizes (0.81)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)

Add feedback

The First Winograd Schema Challenge at IJCAI-16

Davis, Ernest (New York University) | Morgenstern, Leora (Leidos) | Ortiz, Charles L. (Nuance Communications)

AI MagazineOct-2-2017

Six systems were entered, exploiting a variety of technologies. None of the systems were able to advance from the first round to the second and final round. The Winograd Schema Challenge is concerned with finding the referents of pronouns, or solving the pronoun disambiguation problem. Doing this correctly appears to rely on having a solid base of commonsense knowledge and the ability to reason intelligently with that knowledge. This can be seen from considering an example of a Winograd schema. The referent of it in sentence 1 is the backpack; the referent of it in sentence 2 is the water bottle.

artificial intelligence, commonsense reasoning, winograd schema challenge, (16 more...)

AI Magazine

Country:

North America > United States > New York (0.20)
North America > United States > California > Santa Clara County (0.16)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)

Add feedback