AITopics | winograd schema challenge

Collaborating Authors

winograd schema challenge

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

model on any particular supervised task). We compared with GPT-2 (345M) on the Winograd Schema Challenge

Neural Information Processing SystemsAug-20-2025, 01:19:25 GMT

Interesting to see how well the proposed model would do under such zero-shot setup (i.e. GPT -2 accuracy is taken from their paper. The BERT paper reports that BooksCorpus and Wikipedia contain 0.8B and 2.5B words, respectively. For our processed data, BooksCorpus and Wikipedia contain 0.75B and 2B words, respectively. The implementation is the same as word embedding, i.e., a lookup "Segment 1", and "Segment 2") and feed it to model input, which indicates the segment of input tokens.

large language model, natural language, particular supervised task, (17 more...)

Neural Information Processing Systems

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.90)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.61)

Add feedback

Concept-Reversed Winograd Schema Challenge: Evaluating and Improving Robust Reasoning in Large Language Models via Abstraction

Han, Kaiqiao, Fang, Tianqing, Wang, Zhaowei, Song, Yangqiu, Steedman, Mark

arXiv.org Artificial IntelligenceOct-15-2024

While Large Language Models (LLMs) have showcased remarkable proficiency in reasoning, there is still a concern about hallucinations and unreliable reasoning issues due to semantic associations and superficial logical chains. To evaluate the extent to which LLMs perform robust reasoning instead of relying on superficial logical chains, we propose a new evaluation dataset, the Concept-Reversed Winograd Schema Challenge (CR-WSC), based on the famous Winograd Schema Challenge (WSC) dataset. By simply reversing the concepts to those that are more associated with the wrong answer, we find that the performance of LLMs drops significantly despite the rationale of reasoning remaining the same. Furthermore, we propose Abstraction-of-Thought (AoT), a novel prompt method for recovering adversarial cases to normal cases using conceptual abstraction to improve LLMs' robustness and consistency in reasoning, as demonstrated by experiments on CR-WSC.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.1204

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > Dominican Republic (0.04)
(4 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities

Shmidman, Shaltiel, Shmidman, Avi, Cohen, Amir DN, Koppel, Moshe

arXiv.org Artificial IntelligenceJul-9-2024

Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English. Adapting a pre-trained model to a new language involves specialized techniques that differ significantly from training a model from scratch or further training existing models on well-resourced languages such as English. We outline these novel training methodologies, which facilitate effective learning and adaptation to the linguistic properties of Hebrew. Additionally, we fine-tuned DictaLM2.0-Instruct on a comprehensive instruct dataset to enhance its performance on task-specific instructions. To rigorously evaluate our models, we introduce a new benchmark suite for Hebrew LLM evaluation, covering a diverse set of tasks including Question Answering, Sentiment Analysis, Winograd Schema Challenge, Translation, and Summarization. Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.

dictalm2, hebrew, language model, (15 more...)

arXiv.org Artificial Intelligence

2407.0708

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Europe > Germany > Berlin (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge

Park, Brendan, Janecek, Madeline, Ezzati-Jivan, Naser, Li, Yifeng, Emami, Ali

arXiv.org Artificial IntelligenceJun-3-2024

Large Language Models (LLMs) have demonstrated remarkable success in tasks like the Winograd Schema Challenge (WSC), showcasing advanced textual common-sense reasoning. However, applying this reasoning to multimodal domains, where understanding text and images together is essential, remains a substantial challenge. To address this, we introduce WinoVis, a novel dataset specifically designed to probe text-to-image models on pronoun disambiguation within multimodal contexts. Utilizing GPT-4 for prompt generation and Diffusion Attentive Attribution Maps (DAAM) for heatmap analysis, we propose a novel evaluation framework that isolates the models' ability in pronoun disambiguation from other visual processing challenges. Evaluation of successive model versions reveals that, despite incremental advancements, Stable Diffusion 2.0 achieves a precision of 56.7% on WinoVis, only marginally surpassing random guessing. Further error analysis identifies important areas for future research aimed at advancing text-to-image models in their ability to interpret and interact with the complex visual world.

computational linguistic, proceedings, pronoun, (12 more...)

arXiv.org Artificial Intelligence

2405.16277

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(8 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Thai Winograd Schemas: A Benchmark for Thai Commonsense Reasoning

Artkaew, Phakphum

arXiv.org Artificial IntelligenceMay-28-2024

Commonsense reasoning is one of the important aspect of natural language understanding, with several benchmarks developed to evaluate it. However, only a few of these benchmarks are available in languages other than English. Developing parallel benchmarks facilitates cross-lingual evaluation, enabling a better understanding of different languages. This research introduces a collection of Winograd Schemas in Thai, a novel dataset designed to evaluate commonsense reasoning capabilities in the context of the Thai language. Through a methodology involving native speakers, professional translators, and thorough validation, the schemas aim to closely reflect Thai language nuances, idioms, and cultural references while maintaining ambiguity and commonsense challenges. We evaluate the performance of popular large language models on this benchmark, revealing their strengths, limitations, and providing insights into the current state-of-the-art. Results indicate that while models like GPT-4 and Claude-3-Opus achieve high accuracy in English, their performance significantly drops in Thai, highlighting the need for further advancements in multilingual commonsense reasoning.

benchmark, language model, winograd schema, (13 more...)

arXiv.org Artificial Intelligence

2405.18375

Country:

North America > United States > New York (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Add feedback

EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries

Sun, Jing Han, Emami, Ali

arXiv.org Artificial IntelligenceFeb-22-2024

While Large Language Models (LLMs) excel at the Winograd Schema Challenge (WSC), a coreference resolution task testing common-sense reasoning through pronoun disambiguation, they struggle with instances that feature minor alterations or rewording. To address this, we introduce EvoGrad, an open-source platform that harnesses a human-in-the-loop approach to create a dynamic dataset tailored to such altered WSC instances. Leveraging ChatGPT's capabilities, we expand our task instances from 182 to 3,691, setting a new benchmark for diverse common-sense reasoning datasets. Additionally, we introduce the error depth metric, assessing model stability in dynamic tasks. Our results emphasize the challenge posed by EvoGrad: Even the best performing LLM, GPT-3.5, achieves an accuracy of 65.0% with an average error depth of 7.2, a stark contrast to human performance of 92. 8% accuracy without perturbation errors. This highlights ongoing model limitations and the value of dynamic datasets in uncovering them.

computational linguistic, dataset, proceedings, (14 more...)

arXiv.org Artificial Intelligence

2402.13372

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Italy > Tuscany > Florence (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
(15 more...)

Genre:

Overview (0.88)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Human-Machine Collaboration Framework for the Development of Schemas

Isaak, Nicos

arXiv.org Artificial IntelligenceFeb-6-2024

The Winograd Schema Challenge (WSC), a seemingly well-thought-out test for machine intelligence, has been proposed to shed light on developing systems that exhibit human behavior. Since its introduction, it aimed to pivot the focus of the AI community from the technology to the science of AI. While common and trivial for humans, studies show that it is still challenging for machines, especially when they have to deal with novel schemas, that is, well-designed sentences that require the resolving of definite pronouns. As researchers have become increasingly interested in the challenge itself, this presumably necessitates the availability of an extensive collection of Winograd schemas, which goes beyond what human experts can reasonably develop themselves, especially after proposed ways of utilizing them as novel forms of CAPTCHAs. To address this necessity, we propose a novel framework that explicitly focuses on how humans and machines can collaborate as teammates to design novel schemas from scratch. This is being accomplished by combining two recent studies from the literature: i) Winventor, a machine-driven approach for the development of large amounts of Winograd schemas, albeit not of high quality, and ii) WinoFlexi, an online crowdsourcing system that allows crowd workers to develop a limited number of schemas often of similar quality to that of experts. Our proposal crafts a new road map toward developing a novel collaborative platform that amplifies human and machine intelligence by combining their complementary strengths.

mechanism, schema, schema development process, (13 more...)

arXiv.org Artificial Intelligence

2402.07932

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)
Europe > Middle East > Cyprus (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)

Add feedback

Generalised Winograd Schema and its Contextuality

Lo, Kin Ian, Sadrzadeh, Mehrnoosh, Mansfield, Shane

arXiv.org Artificial IntelligenceAug-31-2023

Ambiguities in natural language give rise to probability distributions over interpretations. The distributions are often over multiple ambiguous words at a time; a multiplicity which makes them a suitable topic for sheaf-theoretic models of quantum contextuality. Previous research showed that different quantitative measures of contextuality correlate well with Psycholinguistic research on lexical ambiguities. In this work, we focus on coreference ambiguities and investigate the Winograd Schema Challenge (WSC), a test proposed by Levesque in 2011 to evaluate the intelligence of machines. The WSC consists of a collection of multiple-choice questions that require disambiguating pronouns in sentences structured according to the Winograd schema, in a way that makes it difficult for machines to determine the correct referents but remains intuitive for human comprehension. In this study, we propose an approach that analogously models the Winograd schema as an experiment in quantum physics. However, we argue that the original Winograd Schema is inherently too simplistic to facilitate contextuality. We introduce a novel mechanism for generalising the schema, rendering it analogous to a Bell-CHSH measurement scenario. We report an instance of this generalised schema, complemented by the human judgements we gathered via a crowdsourcing platform. The resulting model violates the Bell-CHSH inequality by 0.192, thus exhibiting contextuality in a coreference resolution setting.

contextuality, scenario, winograd schema, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.4204/EPTCS.384.11

2308.16498

Country:

North America > United States > Indiana (0.04)
North America > Dominican Republic (0.04)
North America > Canada (0.04)
(4 more...)

Genre: Research Report > New Finding (0.86)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)

Add feedback

The Defeat of the Winograd Schema Challenge

Kocijan, Vid, Davis, Ernest, Lukasiewicz, Thomas, Marcus, Gary, Morgenstern, Leora

arXiv.org Artificial IntelligenceJan-23-2023

The Winograd Schema Challenge - a set of twin sentences involving pronoun reference disambiguation that seem to require the use of commonsense knowledge - was proposed by Hector Levesque in 2011. By 2019, a number of AI systems, based on large pre-trained transformer-based language models and fine-tuned on these kinds of problems, achieved better than 90% accuracy. In this paper, we review the history of the Winograd Schema Challenge and discuss the lasting contributions of the flurry of research that has taken place on the WSC in the last decade. We discuss the significance of various datasets developed for WSC, and the research community's deeper understanding of the role of surrogate tasks in assessing the intelligence of an AI system.

artificial intelligence, commonsense reasoning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2201.02387

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.28)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > New York (0.04)
(7 more...)

Genre:

Overview (1.00)
Research Report (0.81)

Industry:

Education (0.93)
Government > Regional Government (0.67)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Application of Pseudo-Log-Likelihoods to Natural Language Scoring

Abramson, Darren, Emami, Ali

arXiv.org Artificial IntelligenceJan-23-2022

Language models built using semi-supervised machine learning on large corpora of natural language have very quickly enveloped the fields of natural language generation and understanding. In this paper we apply a zero-shot approach independently developed by a number of researchers now gaining recognition as a significant alternative to fine-tuning for evaluation on common sense tasks. A language model with relatively few parameters and training steps compared to a more recent language model (T5) can outperform it on a recent large data set (TimeDial), while displaying robustness in its performance across a similar class of language tasks. Surprisingly, this result is achieved by using a hyperparameter-free zero-shot method with the smaller model, compared to fine-tuning to the larger model. We argue that robustness of the smaller model ought to be understood in terms of compositionality, in a sense that we draw from recent literature on a class of similar models. We identify a practical cost for our method and model: high GPU-time for natural language evaluation. The zero-shot measurement technique that produces remarkable stability, both for ALBERT and other BERT variants, is an application of pseudo-log-likelihoods to masked language models for the relative measurement of probability for substitution alternatives in forced choice language tasks such as the Winograd Schema Challenge, Winogrande, and others. One contribution of this paper is to bring together a number of similar, but independent strands of research. We produce some absolute state-of-the-art results for common sense reasoning in binary choice tasks, performing better than any published result in the literature, including fine-tuned efforts. We show a remarkable consistency of the model's performance under adversarial settings, which we argue is best explained by the model's compositionality of representations.

dataset, language model, representation, (15 more...)

arXiv.org Artificial Intelligence

2201.09377

Country:

Asia > Middle East > Jordan (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Canada > Ontario (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback