Large Language Model
Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models
Mikula, Lukáš, Štefánik, Michal, Petrovič, Marek, Sojka, Petr
While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model robustness by evaluating their models on out-of-distribution (OOD) datasets of the same task, but these datasets might share the bias of the training dataset. We propose a simple method for measuring a scale of models' reliance on any identified spurious feature and assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA). We find that the reported OOD gains of debiasing methods can not be explained by mitigated reliance on biased features, suggesting that biases are shared among QA datasets. We further evidence this by measuring that performance of OOD models depends on bias features comparably to the ID model, motivating future work to refine the reports of LLMs' robustness to a level of known spurious features.
SemEval-2023 Task 7: Multi-Evidence Natural Language Inference for Clinical Trial Data
Jullien, Maël, Valentino, Marco, Frost, Hannah, O'Regan, Paul, Landers, Donal, Freitas, André
This paper describes the results of SemEval 2023 task 7 -- Multi-Evidence Natural Language Inference for Clinical Trial Data (NLI4CT) -- consisting of 2 tasks, a Natural Language Inference (NLI) task, and an evidence selection task on clinical trial data. The proposed challenges require multi-hop biomedical and numerical reasoning, which are of significant importance to the development of systems capable of large-scale interpretation and retrieval of medical evidence, to provide personalized evidence-based care. Task 1, the entailment task, received 643 submissions from 40 participants, and Task 2, the evidence selection task, received 364 submissions from 23 participants. The tasks are challenging, with the majority of submitted systems failing to significantly outperform the majority class baseline on the entailment task, and we observe significantly better performance on the evidence selection task than on the entailment task. Increasing the number of model parameters leads to a direct increase in performance, far more significant than the effect of biomedical pre-training. Future works could explore the limitations of large models for generalization and numerical inference, and investigate methods to augment clinical datasets to allow for more rigorous testing and to facilitate fine-tuning. We envisage that the dataset, models, and results of this task will be useful to the biomedical NLI and evidence retrieval communities. The dataset, competition leaderboard, and website are publicly available.
FolkScope: Intention Knowledge Graph Construction for E-commerce Commonsense Discovery
Yu, Changlong, Wang, Weiqi, Liu, Xin, Bai, Jiaxin, Song, Yangqiu, Li, Zheng, Gao, Yifan, Cao, Tianyu, Yin, Bing
Understanding users' intentions in e-commerce platforms requires commonsense knowledge. In this paper, we present FolkScope, an intention knowledge graph construction framework to reveal the structure of humans' minds about purchasing items. As commonsense knowledge is usually ineffable and not expressed explicitly, it is challenging to perform information extraction. Thus, we propose a new approach that leverages the generation power of large language models~(LLMs) and human-in-the-loop annotation to semi-automatically construct the knowledge graph. LLMs first generate intention assertions via e-commerce-specific prompts to explain shopping behaviors, where the intention can be an open reason or a predicate falling into one of 18 categories aligning with ConceptNet, e.g., IsA, MadeOf, UsedFor, etc. Then we annotate plausibility and typicality labels of sampled intentions as training data in order to populate human judgments to all automatic generations. Last, to structurize the assertions, we propose pattern mining and conceptualization to form more condensed and abstract knowledge. Extensive evaluations and studies demonstrate that our constructed knowledge graph can well model e-commerce knowledge and have many potential applications.
Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT
In this paper, we aim to develop a large language model (LLM) with the reasoning ability on complex graph data. Currently, LLMs have achieved very impressive performance on various natural language learning tasks, extensions of which have also been applied to study the vision tasks with multi-modal data. However, when it comes to the graph learning tasks, existing LLMs present very serious flaws due to their several inherited weaknesses in performing {multi-step logic reasoning}, {precise mathematical calculation} and {perception about the spatial and temporal factors}. To address such challenges, in this paper, we will investigate the principles, methodologies and algorithms to empower existing LLMs with graph reasoning ability, which will have tremendous impacts on the current research of both LLMs and graph learning. Inspired by the latest ChatGPT and Toolformer models, we propose the Graph-ToolFormer (Graph Reasoning oriented Toolformer) framework to teach LLMs themselves with prompts augmented by ChatGPT to use external graph reasoning API tools. Specifically, we will investigate to teach Graph-ToolFormer to handle various graph data reasoning tasks in this paper, including both (1) very basic graph data loading and graph property reasoning tasks, ranging from simple graph order and size to the graph diameter and periphery, and (2) more advanced reasoning tasks on real-world graph data, such as bibliographic networks, protein molecules, sequential recommender systems, social networks and knowledge graphs.
Summarizing, Simplifying, and Synthesizing Medical Evidence Using GPT-3 (with Varying Success)
Shaib, Chantal, Li, Millicent L., Joseph, Sebastian, Marshall, Iain J., Li, Junyi Jessy, Wallace, Byron C.
Large language models, particularly GPT-3, are able to produce high quality summaries of general domain news articles in few- and zero-shot settings. However, it is unclear if such models are similarly capable in more specialized, high-stakes domains such as biomedicine. In this paper, we enlist domain experts (individuals with medical training) to evaluate summaries of biomedical articles generated by GPT-3, given zero supervision. We consider both single- and multi-document settings. In the former, GPT-3 is tasked with generating regular and plain-language summaries of articles describing randomized controlled trials; in the latter, we assess the degree to which GPT-3 is able to \emph{synthesize} evidence reported across a collection of articles. We design an annotation scheme for evaluating model outputs, with an emphasis on assessing the factual accuracy of generated summaries. We find that while GPT-3 is able to summarize and simplify single biomedical articles faithfully, it struggles to provide accurate aggregations of findings over multiple documents. We release all data and annotations used in this work.
The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain
Moskvichev, Arseny, Odouard, Victor Vikram, Mitchell, Melanie
The abilities to form and abstract concepts is key to human intelligence, but such abilities remain lacking in state-of-the-art AI systems. There has been substantial research on conceptual abstraction in AI, particularly using idealized domains such as Raven's Progressive Matrices and Bongard problems, but even when AI systems succeed on such problems, the systems are rarely evaluated in depth to see if they have actually grasped the concepts they are meant to capture. In this paper we describe an in-depth evaluation benchmark for the Abstraction and Reasoning Corpus (ARC), a collection of few-shot abstraction and analogy problems developed by Chollet [2019]. In particular, we describe ConceptARC, a new, publicly available benchmark in the ARC domain that systematically assesses abstraction and generalization abilities on a number of basic spatial and semantic concepts. ConceptARC differs from the original ARC dataset in that it is specifically organized around "concept groups" -- sets of problems that focus on specific concepts and that are vary in complexity and level of abstraction. We report results on testing humans on this benchmark as well as three machine solvers: the top two programs from a 2021 ARC competition and OpenAI's GPT-4. Our results show that humans substantially outperform the machine solvers on this benchmark, showing abilities to abstract and generalize concepts that are not yet captured by AI systems. We believe that this benchmark will spur improvements in the development of AI systems for conceptual abstraction and in the effective evaluation of such systems.
Musketeer (All for One, and One for All): A Generalist Vision-Language Model with Task Explanation Prompts
Zhang, Zhaoyang, Shen, Yantao, Shi, Kunyu, Cai, Zhaowei, Fang, Jun, Deng, Siqi, Yang, Hao, Modolo, Davide, Tu, Zhuowen, Soatto, Stefano
We present a sequence-to-sequence vision-language model whose parameters are jointly trained on all tasks (all for one) and fully shared among multiple tasks (one for all), resulting in a single model which we named Musketeer. The integration of knowledge across heterogeneous tasks is enabled by a novel feature called Task Explanation Prompt (TEP). TEP reduces interference among tasks, allowing the model to focus on their shared structure. With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.
Exploring Zero and Few-shot Techniques for Intent Classification
Parikh, Soham, Vohra, Quaizar, Tumbade, Prashil, Tiwari, Mitul
Conversational NLU providers often need to scale to thousands of intent-classification models where new customers often face the cold-start problem. Scaling to so many customers puts a constraint on storage space as well. In this paper, we explore four different zero and few-shot intent classification approaches with this low-resource constraint: 1) domain adaptation, 2) data augmentation, 3) zero-shot intent classification using descriptions large language models (LLMs), and 4) parameter-efficient fine-tuning of instruction-finetuned language models. Our results show that all these approaches are effective to different degrees in low-resource settings. Parameter-efficient fine-tuning using T-few recipe (Liu et al., 2022) on Flan-T5 (Chang et al., 2022) yields the best performance even with just one sample per intent. We also show that the zero-shot method of prompting LLMs using intent descriptions
Google unveils its multilingual, code-generating PaLM 2 language model
Google has stood at the forefront at many of the tech industry's AI breakthroughs in recent years, Zoubin Ghahramani, Vice President of Google DeepMind, declared in a blog post while asserting that the company's work in foundation models, are "the bedrock for the industry and the AI-powered products that billions of people use daily." On Wednesday, Ghahramani and other Google executives took the Shoreline Amphitheater stage to show off its latest and greatest large language model, PaLM 2, which now comes in four sizes able to run locally on everything from mobile devices to server farms. PaLM 2, obviously, is the successor to Google's existing PaLM model that, until recently, powered its experimental Bard AI. "Think of PaLM as a general model that then can be fine tuned to achieve particular tasks," he explained during a reporters call earlier in the week. "For example: health research teams have fine tuned PaLM with with medical knowledge to help answer questions and summarize insights from a variety of dense medical texts." Ghahramani also notes that PaLM was "the first large language model to perform an expert level on the US medical licensing exam."
Google launches new AI PaLM 2 in attempt to regain leadership of the pack
Google is attempting to reclaim its crown as the leader in artificial intelligence with PaLM 2, a "next-generation language model" that the company says outperforms other leading systems on some tasks. Revealing the cutting-edge AI at its annual I/O conference, alongside a foldable Pixel phone and a new tablet, Google said it would be built in to 25 new products and features, as the company races to catch up with competitors after years of producing AI research but few products. Like other "large language models" such as OpenAI's GPT, PaLM 2 is a general-purpose AI model, which can be used to power ChatGPT-style chatbots but also translate between languages, write computer code, or even analyse and respond to images. Combining those capabilities, a user could ask a question in English about a restaurant in Bulgaria, and the system would be able to search the web for Bulgarian responses, find an answer, translate the answer into English, add a picture of the location – and then follow up with a code snippet to create a database entry for the place. "The neural network revolution that we are now experiencing started around 10 years ago," said Slav Petrov, the co-lead of the PaLM 2 project, "and it started in part at Google."