Goto

Collaborating Authors

 winograd schema challenge


model on any particular supervised task). We compared with GPT-2 (345M) on the Winograd Schema Challenge

Neural Information Processing Systems

Interesting to see how well the proposed model would do under such zero-shot setup (i.e. GPT -2 accuracy is taken from their paper. The BERT paper reports that BooksCorpus and Wikipedia contain 0.8B and 2.5B words, respectively. For our processed data, BooksCorpus and Wikipedia contain 0.75B and 2B words, respectively. The implementation is the same as word embedding, i.e., a lookup "Segment 1", and "Segment 2") and feed it to model input, which indicates the segment of input tokens.


model on any particular supervised task). We compared with GPT-2 (345M) on the Winograd Schema Challenge

Neural Information Processing Systems

Interesting to see how well the proposed model would do under such zero-shot setup (i.e. GPT -2 accuracy is taken from their paper. The BERT paper reports that BooksCorpus and Wikipedia contain 0.8B and 2.5B words, respectively. For our processed data, BooksCorpus and Wikipedia contain 0.75B and 2B words, respectively. The implementation is the same as word embedding, i.e., a lookup "Segment 1", and "Segment 2") and feed it to model input, which indicates the segment of input tokens.


Concept-Reversed Winograd Schema Challenge: Evaluating and Improving Robust Reasoning in Large Language Models via Abstraction

Han, Kaiqiao, Fang, Tianqing, Wang, Zhaowei, Song, Yangqiu, Steedman, Mark

arXiv.org Artificial Intelligence

While Large Language Models (LLMs) have showcased remarkable proficiency in reasoning, there is still a concern about hallucinations and unreliable reasoning issues due to semantic associations and superficial logical chains. To evaluate the extent to which LLMs perform robust reasoning instead of relying on superficial logical chains, we propose a new evaluation dataset, the Concept-Reversed Winograd Schema Challenge (CR-WSC), based on the famous Winograd Schema Challenge (WSC) dataset. By simply reversing the concepts to those that are more associated with the wrong answer, we find that the performance of LLMs drops significantly despite the rationale of reasoning remaining the same. Furthermore, we propose Abstraction-of-Thought (AoT), a novel prompt method for recovering adversarial cases to normal cases using conceptual abstraction to improve LLMs' robustness and consistency in reasoning, as demonstrated by experiments on CR-WSC.


Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities

Shmidman, Shaltiel, Shmidman, Avi, Cohen, Amir DN, Koppel, Moshe

arXiv.org Artificial Intelligence

Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English. Adapting a pre-trained model to a new language involves specialized techniques that differ significantly from training a model from scratch or further training existing models on well-resourced languages such as English. We outline these novel training methodologies, which facilitate effective learning and adaptation to the linguistic properties of Hebrew. Additionally, we fine-tuned DictaLM2.0-Instruct on a comprehensive instruct dataset to enhance its performance on task-specific instructions. To rigorously evaluate our models, we introduce a new benchmark suite for Hebrew LLM evaluation, covering a diverse set of tasks including Question Answering, Sentiment Analysis, Winograd Schema Challenge, Translation, and Summarization. Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.


Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge

Park, Brendan, Janecek, Madeline, Ezzati-Jivan, Naser, Li, Yifeng, Emami, Ali

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable success in tasks like the Winograd Schema Challenge (WSC), showcasing advanced textual common-sense reasoning. However, applying this reasoning to multimodal domains, where understanding text and images together is essential, remains a substantial challenge. To address this, we introduce WinoVis, a novel dataset specifically designed to probe text-to-image models on pronoun disambiguation within multimodal contexts. Utilizing GPT-4 for prompt generation and Diffusion Attentive Attribution Maps (DAAM) for heatmap analysis, we propose a novel evaluation framework that isolates the models' ability in pronoun disambiguation from other visual processing challenges. Evaluation of successive model versions reveals that, despite incremental advancements, Stable Diffusion 2.0 achieves a precision of 56.7% on WinoVis, only marginally surpassing random guessing. Further error analysis identifies important areas for future research aimed at advancing text-to-image models in their ability to interpret and interact with the complex visual world.


Thai Winograd Schemas: A Benchmark for Thai Commonsense Reasoning

Artkaew, Phakphum

arXiv.org Artificial Intelligence

Commonsense reasoning is one of the important aspect of natural language understanding, with several benchmarks developed to evaluate it. However, only a few of these benchmarks are available in languages other than English. Developing parallel benchmarks facilitates cross-lingual evaluation, enabling a better understanding of different languages. This research introduces a collection of Winograd Schemas in Thai, a novel dataset designed to evaluate commonsense reasoning capabilities in the context of the Thai language. Through a methodology involving native speakers, professional translators, and thorough validation, the schemas aim to closely reflect Thai language nuances, idioms, and cultural references while maintaining ambiguity and commonsense challenges. We evaluate the performance of popular large language models on this benchmark, revealing their strengths, limitations, and providing insights into the current state-of-the-art. Results indicate that while models like GPT-4 and Claude-3-Opus achieve high accuracy in English, their performance significantly drops in Thai, highlighting the need for further advancements in multilingual commonsense reasoning.


EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries

Sun, Jing Han, Emami, Ali

arXiv.org Artificial Intelligence

While Large Language Models (LLMs) excel at the Winograd Schema Challenge (WSC), a coreference resolution task testing common-sense reasoning through pronoun disambiguation, they struggle with instances that feature minor alterations or rewording. To address this, we introduce EvoGrad, an open-source platform that harnesses a human-in-the-loop approach to create a dynamic dataset tailored to such altered WSC instances. Leveraging ChatGPT's capabilities, we expand our task instances from 182 to 3,691, setting a new benchmark for diverse common-sense reasoning datasets. Additionally, we introduce the error depth metric, assessing model stability in dynamic tasks. Our results emphasize the challenge posed by EvoGrad: Even the best performing LLM, GPT-3.5, achieves an accuracy of 65.0% with an average error depth of 7.2, a stark contrast to human performance of 92. 8% accuracy without perturbation errors. This highlights ongoing model limitations and the value of dynamic datasets in uncovering them.


A Human-Machine Collaboration Framework for the Development of Schemas

Isaak, Nicos

arXiv.org Artificial Intelligence

The Winograd Schema Challenge (WSC), a seemingly well-thought-out test for machine intelligence, has been proposed to shed light on developing systems that exhibit human behavior. Since its introduction, it aimed to pivot the focus of the AI community from the technology to the science of AI. While common and trivial for humans, studies show that it is still challenging for machines, especially when they have to deal with novel schemas, that is, well-designed sentences that require the resolving of definite pronouns. As researchers have become increasingly interested in the challenge itself, this presumably necessitates the availability of an extensive collection of Winograd schemas, which goes beyond what human experts can reasonably develop themselves, especially after proposed ways of utilizing them as novel forms of CAPTCHAs. To address this necessity, we propose a novel framework that explicitly focuses on how humans and machines can collaborate as teammates to design novel schemas from scratch. This is being accomplished by combining two recent studies from the literature: i) Winventor, a machine-driven approach for the development of large amounts of Winograd schemas, albeit not of high quality, and ii) WinoFlexi, an online crowdsourcing system that allows crowd workers to develop a limited number of schemas often of similar quality to that of experts. Our proposal crafts a new road map toward developing a novel collaborative platform that amplifies human and machine intelligence by combining their complementary strengths.


Generalised Winograd Schema and its Contextuality

Lo, Kin Ian, Sadrzadeh, Mehrnoosh, Mansfield, Shane

arXiv.org Artificial Intelligence

Ambiguities in natural language give rise to probability distributions over interpretations. The distributions are often over multiple ambiguous words at a time; a multiplicity which makes them a suitable topic for sheaf-theoretic models of quantum contextuality. Previous research showed that different quantitative measures of contextuality correlate well with Psycholinguistic research on lexical ambiguities. In this work, we focus on coreference ambiguities and investigate the Winograd Schema Challenge (WSC), a test proposed by Levesque in 2011 to evaluate the intelligence of machines. The WSC consists of a collection of multiple-choice questions that require disambiguating pronouns in sentences structured according to the Winograd schema, in a way that makes it difficult for machines to determine the correct referents but remains intuitive for human comprehension. In this study, we propose an approach that analogously models the Winograd schema as an experiment in quantum physics. However, we argue that the original Winograd Schema is inherently too simplistic to facilitate contextuality. We introduce a novel mechanism for generalising the schema, rendering it analogous to a Bell-CHSH measurement scenario. We report an instance of this generalised schema, complemented by the human judgements we gathered via a crowdsourcing platform. The resulting model violates the Bell-CHSH inequality by 0.192, thus exhibiting contextuality in a coreference resolution setting.


The Defeat of the Winograd Schema Challenge

Kocijan, Vid, Davis, Ernest, Lukasiewicz, Thomas, Marcus, Gary, Morgenstern, Leora

arXiv.org Artificial Intelligence

The Winograd Schema Challenge - a set of twin sentences involving pronoun reference disambiguation that seem to require the use of commonsense knowledge - was proposed by Hector Levesque in 2011. By 2019, a number of AI systems, based on large pre-trained transformer-based language models and fine-tuned on these kinds of problems, achieved better than 90% accuracy. In this paper, we review the history of the Winograd Schema Challenge and discuss the lasting contributions of the flurry of research that has taken place on the WSC in the last decade. We discuss the significance of various datasets developed for WSC, and the research community's deeper understanding of the role of surrogate tasks in assessing the intelligence of an AI system.