AITopics | Jurayj, William

Collaborating Authors

Jurayj, William

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

Ou, Jiefu, Walden, William Gantt, Sanders, Kate, Jiang, Zhengping, Sun, Kaiser, Cheng, Jeffrey, Jurayj, William, Wanner, Miriam, Liang, Shaobo, Morgan, Candice, Han, Seunghoon, Wang, Weiqi, May, Chandler, Recknor, Hannah, Khashabi, Daniel, Van Durme, Benjamin

arXiv.org Artificial IntelligenceMar-27-2025

A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper's claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.21717

Country:

Asia (0.68)
North America > United States (0.67)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Jurayj, William, Cheng, Jeffrey, Van Durme, Benjamin

arXiv.org Artificial IntelligenceFeb-19-2025

Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2502.13962

Country:

North America > United States (0.14)
Asia (0.14)
Oceania > Australia (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.43)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.34)

Add feedback

Garden-Path Traversal in GPT-2

Jurayj, William, Rudman, William, Eickhoff, Carsten

arXiv.org Artificial IntelligenceOct-20-2022

In recent years, large-scale transformer decoders such as the GPT-x family of models have become increasingly popular. Studies examining the behavior of these models tend to focus only on the output of the language modeling head and avoid analysis of the internal states of the transformer decoder. In this study, we present a collection of methods to analyze the hidden states of GPT-2 and use the model's navigation of garden path sentences as a case study. To enable this, we compile the largest currently available dataset of garden path sentences. We show that Manhattan distances and cosine similarities provide more reliable insights compared to established surprisal methods that analyze next-token probabilities computed by a language modeling head. Using these methods, we find that negating tokens have minimal impacts on the model's representations for unambiguous forms of sentences with ambiguity solely over what the object of a verb is, but have a more substantial impact of representations for unambiguous sentences Figure 1: Hidden state relations (Top: cosine similarity, whose ambiguity would stem from the voice Middle: Manhattan distance, Bottom: surprisal difference) of a verb. Further, we find that analyzing the between negated and non-negated forms of garden decoder model's hidden states reveals periods path and unambiguous sentences. The ambiguous of ambiguity that might conclude in a garden verb "walked" primes the effect later in the sentence, path effect but happen not to, whereas surprisal while the unambiguous "taken" avoids it. The verb "lit" analyses routinely miss this detail.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2205.12302

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback