AITopics | checker

Collaborating Authors

checker

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

445e1050156c6ae8c082a8422bb7dfc0-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 06:06:53 GMT

electrode, neuron, selection, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Virginia (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > Canada (0.04)
(2 more...)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.30)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Cognitive Science (0.68)

Add feedback

To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis

Bianchi, Federico, Kwon, Yongchan, Izzo, Zachary, Zhang, Linjun, Zou, James

arXiv.org Artificial IntelligenceDec-8-2025

How many mistakes do published AI papers contain? Peer-reviewed publications form the foundation upon which new research and knowledge are built. Errors that persist in the literature can propagate unnoticed, creating confusion in follow-up studies and complicating reproducibility. The accelerating pace of research and the increasing demands on the peer-review system make such mistakes harder to detect and avoid. To address this, we developed a Paper Correctness Checker based on GPT-5 to systematically identify mistakes in papers previously published at top AI conferences and journals. Our analysis focuses on objective mistakes-e.g., errors in formulas, derivations, calculations, figures, and tables-that have a clearly verifiable ground truth. We intentionally exclude subjective considerations such as novelty, importance, or writing quality. We find that published papers contain a non-negligible number of objective mistakes and that the average number of mistakes per paper has increased over time-from 3.8 in NeurIPS 2021 to 5.9 in NeurIPS 2025 (55.3% increase); from 4.1 in ICLR 2018 to 5.2 in ICLR 2025; and from 5.0 in TMLR 2022/23 to 5.5 in TMLR 2025. Human experts reviewed 316 potential mistakes identified by the AI Checker and confirmed that 263 were actual mistakes, corresponding to a precision of 83.2%. While most identified issues are relatively minor, correcting them would reduce confusion in the literature and strengthen reproducibility. The AI Checker also surfaced potentially more substantive mistakes that could affect the interpretation of results. Moreover, we show that the AI Checker can propose correct fixes for 75.8% of the identified mistakes. Overall, this study highlights the potential of frontier LLMs to detect and correct objective mistakes in published papers, helping to establish a firmer foundation of knowledge.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2512.05925

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Add feedback

Aria: An Agent For Retrieval and Iterative Auto-Formalization via Dependency Graph

Wang, Hanyu, Xie, Ruohan, Wang, Yutong, Gao, Guoxiong, Yu, Xintao, Dong, Bin

arXiv.org Artificial IntelligenceOct-7-2025

Accurate auto-formalization of theorem statements is essential for advancing automated discovery and verification of research-level mathematics, yet remains a major bottleneck for LLMs due to hallucinations, semantic mismatches, and their inability to synthesize new definitions. To tackle these issues, we present Aria (Agent for Retrieval and Iterative Autoformalization), a system for conjecture-level formalization in Lean that emulates human expert reasoning via a two-phase Graph-of-Thought process: recursively decomposing statements into a dependency graph and then constructing formalizations from grounded concepts. To ensure semantic correctness, we introduce AriaScorer, a checker that retrieves definitions from Mathlib for term-level grounding, enabling rigorous and reliable verification. We evaluate Aria on diverse benchmarks. On ProofNet, it achieves 91.6% compilation success rate and 68.5% final accuracy, surpassing previous methods. On FA TE-X, a suite of challenging algebra problems from research literature, it outperforms the best baseline with 44.0% vs. 24.0% On a dataset of homological conjectures, Aria reaches 42.9% final accuracy while all other models score 0%. In recent years, Interactive Theorem Provers (ITPs) such as Coq (Barras et al., 1999), Isabelle (Paul-son, 1994) and Lean (Moura & Ullrich, 2021) have become crucial ecosystems for formalized mathematics. Among these, Lean 4, together with its comprehensive library Mathlib (mathlib Community, 2020), is pioneering a new paradigm for formalization. However, the continuous growth of this ecosystem is always constrained by the immense manual effort and the deep expertise that formalization demands. To address this, the research community has turned to Large Language Models (LLMs) for auto-formalization the process of translating informal (or natural language) mathematical statements and proofs into their formal counterparts. While these two processes are interconnected, the accurate formalization of statements is the foundational first step. A correctly formalized statement is a prerequisite for any valid proof and, on its own, is a valuable asset to the mathematical ecosystem, enabling better search, integration, and verification. Thus, despite progress in proof automation (Ren et al., 2025; Chen et al., 2025), the fidelity of this initial statement translation remains a critical bottleneck. LLMs frequently generate formal statements that suffer not only from compilation errors but also from more insidious semantic flaws, a challenge that intensifies when formalizing more complex research or conjecture-level statements.

formalization, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2510.0452

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

445e1050156c6ae8c082a8422bb7dfc0-Paper.pdf

Neural Information Processing SystemsOct-2-2025, 19:13:21 GMT

artificial intelligence, machine learning, neuron, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.68)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.30)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Cognitive Science (0.68)

Add feedback

Non-Termination Proving: 100 Million LoC and Beyond

Vanegue, Julien, Villard, Jules, O'Hearn, Peter, Raad, Azalea

arXiv.org Artificial IntelligenceSep-8-2025

We report on our tool, Pulse Infinite, that uses proof techniques to show non-termination (divergence) in large programs. Pulse Infinite works compositionally and under-approximately: the former supports scale, and the latter ensures soundness for proving divergence. Prior work focused on small benchmarks in the tens or hundreds of lines of code (LoC), and scale limits their practicality: a single company may have tens of millions, or even hundreds of millions of LoC or more. We report on applying Pulse Infinite to over a hundred million lines of open-source and proprietary software written in C, C++, and Hack, identifying over 30 previously unknown issues, establishing a new state of the art for detecting divergence in real-world codebases.

artificial intelligence, divergence, programming language, (19 more...)

arXiv.org Artificial Intelligence

2509.05293

Country:

Europe (0.93)
North America > United States (0.68)

Genre: Research Report (0.64)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)

Add feedback

CodeContests+: High-Quality Test Case Generation for Competitive Programming

Wang, Zihan, Liu, Siyao, Sun, Yang, Li, Hongyan, Shen, Kai

arXiv.org Artificial IntelligenceJun-9-2025

Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount of public problem data, such as problem statements and solutions, is available, the test cases of these problems are often difficult to obtain. Therefore, test case generation is a necessary task for building large-scale datasets, and the quality of the test cases directly determines the accuracy of the evaluation. In this paper, we introduce an LLM-based agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+. We evaluated the quality of test cases in CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels to examine the accuracy of these test cases in evaluation. The results indicated that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR). Subsequently, our experiments in LLM Reinforcement Learning (RL) further confirmed that improvements in test case quality yield considerable advantages for RL.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.05817

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning

Zhang, Jusheng, Fan, Yijia, Lin, Wenjun, Chen, Ruiqi, Jiang, Haoyi, Chai, Wenhao, Wang, Jian, Wang, Keze

arXiv.org Artificial IntelligenceMay-30-2025

We propose GAM-Agent, a game-theoretic multi-agent framework for enhancing vision-language reasoning. Unlike prior single-agent or monolithic models, GAM-Agent formulates the reasoning process as a non-zero-sum game between base agents--each specializing in visual perception subtasks--and a critical agent that verifies logic consistency and factual correctness. Agents communicate via structured claims, evidence, and uncertainty estimates. The framework introduces an uncertainty-aware controller to dynamically adjust agent collaboration, triggering multi-round debates when disagreement or ambiguity is detected. This process yields more robust and interpretable predictions. Experiments on four challenging benchmarks--MMMU, MMBench, MVBench, and V*Bench--demonstrate that GAM-Agent significantly improves performance across various VLM backbones. Notably, GAM-Agent boosts the accuracy of small-to-mid scale models (e.g., Qwen2.5-VL-7B, InternVL3-14B) by 5--6\%, and still enhances strong models like GPT-4o by up to 2--3\%. Our approach is modular, scalable, and generalizable, offering a path toward reliable and explainable multi-agent multimodal reasoning.

large language model, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

2505.23399

Genre: Research Report > New Finding (0.45)

Industry:

Leisure & Entertainment > Games (0.67)
Transportation (0.46)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(4 more...)

Add feedback

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Fan, Wan-Cyuan, Rahman, Tanzila, Sigal, Leonid

arXiv.org Artificial IntelligenceDec-23-2024

With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.18072

Country: