Goto

Collaborating Authors

 boolean question


ACPBench: Reasoning about Action, Change, and Planning

Kokel, Harsha, Katz, Michael, Srinivas, Kavitha, Sohrabi, Shirin

arXiv.org Artificial Intelligence

There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multi-step reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. In this work, we present ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows us to synthesize problems with provably correct solutions across many tasks and domains. Further, it allows us the luxury of scale without additional human effort, i.e., many additional problems can be created automatically. Our extensive evaluation of 22 LLMs and OpenAI o1 reasoning models highlights the significant gap in the reasoning capability of the LLMs. Our findings with OpenAI o1, a multi-turn reasoning model, reveal significant gains in performance on multiple-choice questions, yet surprisingly, no notable progress is made on boolean questions. The ACPBench collection is available at https://ibm.github.io/ACPBench.


"What is the value of {templates}?" Rethinking Document Information Extraction Datasets for LLMs

Zmigrod, Ran, Shetty, Pranav, Sibue, Mathieu, Ma, Zhiqiang, Nourbakhsh, Armineh, Liu, Xiaomo, Veloso, Manuela

arXiv.org Artificial Intelligence

The rise of large language models (LLMs) for visually rich document understanding (VRDU) has kindled a need for prompt-response, document-based datasets. As annotating new datasets from scratch is labor-intensive, the existing literature has generated prompt-response datasets from available resources using simple templates. For the case of key information extraction (KIE), one of the most common VRDU tasks, past work has typically employed the template "What is the value for the {key}?". However, given the variety of questions encountered in the wild, simple and uniform templates are insufficient for creating robust models in research and industrial contexts. In this work, we present K2Q, a diverse collection of five datasets converted from KIE to a prompt-response format using a plethora of bespoke templates. The questions in K2Q can span multiple entities and be extractive or boolean. We empirically compare the performance of seven baseline generative models on K2Q with zero-shot prompting. We further compare three of these models when training on K2Q versus training on simpler templates to motivate the need of our work. We find that creating diverse and intricate KIE questions enhances the performance and robustness of VRDU models. We hope this work encourages future studies on data quality for generative model training.


Characterizing LLM Abstention Behavior in Science QA with Context Perturbations

Wen, Bingbing, Howe, Bill, Wang, Lucy Lu

arXiv.org Artificial Intelligence

The correct model response in the face of uncertainty is to abstain from answering a question so as not to mislead the user. In this work, we study the ability of LLMs to abstain from answering context-dependent science questions when provided insufficient or incorrect context. We probe model sensitivity in several settings: removing gold context, replacing gold context with irrelevant context, and providing additional context beyond what is given. In experiments on four QA datasets with four LLMs, we show that performance varies greatly across models, across the type of context provided, and also by question type; in particular, many LLMs seem unable to abstain from answering boolean questions using standard QA prompts. Our analysis also highlights the unexpected impact of abstention performance on QA task accuracy. Counter-intuitively, in some settings, replacing gold context with irrelevant context or adding irrelevant context to gold context can improve abstention performance in a way that results in improvements in task performance. Our results imply that changes are needed in QA dataset design and evaluation to more effectively assess the correctness and downstream impacts of model abstention.


An AI-based Solution for Enhancing Delivery of Digital Learning for Future Teachers

Kang, Yong-Bin, Forkan, Abdur Rahim Mohammad, Jayaraman, Prem Prakash, Wieland, Natalie, Kollias, Elizabeth, Du, Hung, Thomson, Steven, Li, Yuan-Fang

arXiv.org Artificial Intelligence

However, up until the COVID-19 pandemic caused a seismic shift in the education sector, few educational institutions had fully developed digital learning models in place and adoption of digital models was ad-hoc or only partially integrated alongside traditional teaching modes [1]. In the wake of the disruptive impact of the pandemic, the education sector and more importantly educators have had to move rapidly to take up digital solutions to continue delivering learning. At the most rudimentary level, this has meant moving to online teaching through platforms such as Zoom, Google, Teams and Interactive Whiteboards and delivering pre-recorded educational materials via Learning Management Systems (e.g., Echo). Digital learning is now simply part of the education landscape both in the traditional education sector as well as within the context of corporate and workplace learning. A key challenge future teachers face when delivering educational content via digital learning is to be able to assess what the learner knows and understands, the depths of that knowledge and understanding and any gaps in that learning. Assessment also occurs in the context of the cohort and relevant band or level of learning. The Teachers Guide to Assessment produced by the Australian Capital Territory Government [2] identified that teachers and learning designers were particularly challenged by the assessment process, and that new technologies have the potential to transform existing digital teaching and learning practices through refined information gathering and the ability to enhance the nature of learner feedback. Artificial Intelligence (AI) is part of the next generation of digital learning, enabling educators to create learning content, stream content to suit individual learner needs and access and in turn respond to data based on learner performance and feedback [3]. AI has the capacity to provide significant benefits to teachers to deliver nuanced and personalised experiences to learners.