AITopics | gold standard

Final-Model-Only Data Attribution with a Unifying View of Gradient-Based Methods

Neural Information Processing SystemsJun-20-2026, 09:11:35 GMT

Training data attribution (TDA) is concerned with understanding model behavior in terms of the training data. This paper draws attention to the common setting where one has access only to the final trained model, and not the training algorithm or intermediate information from training.

artificial intelligence, machine learning, similarity, (17 more...)

Neural Information Processing Systems

Country: North America > United States > Minnesota (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Add feedback

WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking

Neural Information Processing SystemsMar-20-2026, 20:47:45 GMT

While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation and placed less emphasis on establishing best benchmarking practices. We posit that without a sound model evaluation framework, the AI community's efforts cannot reach their full potential, thereby slowing the progress and transfer of innovation into real-world drug discovery.Thus, in this paper, we seek to establish a new gold standard for small molecule drug discovery benchmarking, .

artificial intelligence, machine learning, proceedings, (12 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation

Shivapratap Gopakumar, Sunil Gupta, Santu Rana, Vu Nguyen, Svetha Venkatesh

Neural Information Processing SystemsMar-16-2026, 08:11:09 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation

Shivapratap Gopakumar, Sunil Gupta, Santu Rana, Vu Nguyen, Svetha Venkatesh

Neural Information Processing SystemsFeb-14-2026, 15:02:20 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, assurance, optimization, (15 more...)

Neural Information Processing Systems

Country:

Oceania > Australia (0.14)
North America > Canada > Quebec > Montreal (0.04)
Asia (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

b5d17ed2b502da15aa727af0d51508d6-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-9-2026, 22:57:49 GMT

annotation, dataset, reliability, (15 more...)

Neural Information Processing Systems

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (0.72)
Information Technology > Artificial Intelligence > Machine Learning (0.72)

Add feedback

4ea14e6090343523ddcd5d3ca449695f-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-8-2026, 20:44:31 GMT

dataset, participant, search image, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Software > Programming Languages (0.69)
Information Technology > Artificial Intelligence > Machine Learning (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.46)

Add feedback

Bench4KE: Benchmarking Automated Competency Question Generation

Lippolis, Anna Sofia, Ragagni, Minh Davide, Ciancarini, Paolo, Nuzzolese, Andrea Giovanni, Presutti, Valentina

arXiv.org Artificial IntelligenceDec-10-2025

The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation. This trend is already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs), natural language questions used by ontology engineers to define the functional requirements of an ontology. However, the evaluation of these tools lacks standardization. This undermines the methodological rigor and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. The presented release focuses on evaluating tools that generate CQs automatically. Bench4KE provides a curated gold standard consisting of CQ datasets from 17 real-world ontology engineering projects and uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of 6 recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.24554

Country:

Europe > Italy (0.47)
North America > United States (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Save 30 on This All-Clad Nonstick Frying Pan Set

WIREDOct-29-2025, 19:39:14 GMT

Life is too short to use bad nonstick cookware. These All-Clad pans are the gold standard, and they're less expensive than usual. All products featured on WIRED are independently selected by our editors. However, we may receive compensation from retailers and/or from purchases of products through these links. It can be hard to build an Adulting Arsenal.

all-clad nonstick frying pan set, nonstick frying pan set, pan set, (14 more...)

WIRED

Country:

North America > United States > New Mexico (0.05)
North America > United States > California (0.05)
North America > Mexico > Mexico City > Mexico City (0.05)
(2 more...)

Industry:

Transportation (0.52)
Retail (0.36)

Technology:

Information Technology > Communications (0.48)
Information Technology > Artificial Intelligence (0.48)

Add feedback

Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Mealey, Kathleen P., Karr, Jonathan A. Jr., Moreira, Priscila Saboia, Brenner, Paul R., Vardeman, Charles F. II

arXiv.org Artificial IntelligenceOct-28-2025

Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.nlp.2025.100187

2507.22935

Country:

Europe (0.92)
North America > United States > Massachusetts (0.28)

Genre: Research Report > New Finding (0.67)

Industry:

Transportation > Air (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Aerospace & Defense > Aircraft (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

Thelwall, Mike, Mohammadi, Ehsan

arXiv.org Artificial IntelligenceOct-28-2025

Assessing published academic journal articles is a common task for evaluations of departments and individuals. Whilst it is sometimes supported by citation data, Large Language Models (LLMs) may give more useful indications of article quality. Evidence of this capability exists for two of the largest LLM families, ChatGPT and Gemini, and the medium sized LLM Gemma3 27b, but it is unclear whether smaller LLMs and reasoning models have similar abilities. This is important because larger models may be slow and impractical in some situations, and reasoning models may perform differently. Four relevant questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1, on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. The results suggest that smaller (open weights) and reasoning LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and few-shot prompts (four examples) tended to help but the evidence was equivocal. Reasoning models did not have a clear advantage. Overall, the results show, for the first time, that smaller LLMs >4b, including reasoning models, have a substantial capability to score journal articles for research quality, especially if score averaging is used.

correlation, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2510.22389

Country:

Europe > United Kingdom (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

gold standard

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Final-Model-Only Data Attribution with a Unifying View of Gradient-Based Methods

WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking

Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation

Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation

b5d17ed2b502da15aa727af0d51508d6-AuthorFeedback.pdf

4ea14e6090343523ddcd5d3ca449695f-Supplemental-Datasets_and_Benchmarks.pdf

Bench4KE: Benchmarking Automated Competency Question Generation

Save 30 on This All-Clad Nonstick Frying Pan Set

Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?