AITopics | integer

Collaborating Authors

integer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions

Neural Information Processing SystemsFeb-12-2026, 19:32:21 GMT

We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs.

large language model, machine learning, programming language, (19 more...)

Neural Information Processing Systems

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (0.67)

Industry:

Education (0.67)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A Proofs A.1 Learning D

Neural Information Processing SystemsDec-27-2025, 20:39:29 GMT

For an overview of its proof, see Appendix B. Lemma A.1. In the following lemma, we use Lemma A.1 in order to show RSA T -hardness of By Assumption 2.1, there is K such that CSP K literals in the clause are satisfied by ψ, and otherwise null z, w null 1 . A.3 Hardness of learning random fully-connected neural networks Let n = ( n Let M be a diagonal-blocks matrix. By Lemma A.3, we have s By Lemma A.4, we have with probability 1 o Finally, Theorem 3.1 follows immediately from Theorem A.1 and the following lemma. By Lemma A.6, we have that By Theorem A.1, we need to show that SCAT We say that a distribution is isotropic if it has mean zero and its covariance matrix is the identity.

artificial intelligence, log 2, machine learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Add feedback

Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

Guo, Dadi, Liu, Jiayu, Fan, Zhiyuan, He, Zhitao, Li, Haoran, Li, Yuxin, Wang, Yumeng, Fung, Yi R.

arXiv.org Artificial IntelligenceDec-10-2025

Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models' performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models' self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.17114

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Austria > Vienna (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Educational Setting (0.68)
Education > Curriculum (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)
(2 more...)

Add feedback

How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

Roig, JV

arXiv.org Artificial IntelligenceDec-10-2025

We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.

granite 4, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2512.07497

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Going All-In on LLM Accuracy: Fake Prediction Markets, Real Confidence Signals

Todasco, Michael

arXiv.org Artificial IntelligenceDec-9-2025

Large language models are increasingly used to evaluate other models, yet these judgments typically lack any representation of confidence. This pilot study tests whether framing an evaluation task as a betting game (a fictional prediction market with its own LLM currency) improves forecasting accuracy and surfaces calibrated confidence signals. We generated 100 math and logic questions with verifiable answers. Six Baseline models (three current-generation, three prior-generation) answered all items. Three Predictor models then forecasted, for each question-baseline pair, if the baseline would answer correctly. Each predictor completed matched runs in two conditions: Control (simple correct/incorrect predictions) and Incentive (predictions plus wagers of 1-100,000 LLMCoin under even odds, starting from a 1,000,000 LLMCoin bankroll). Across 5,400 predictions per condition, Incentive runs showed modestly higher accuracy (81.5% vs. 79.1%, p = .089, d = 0.86) and significantly faster learning across rounds (12.0 vs. 2.9 percentage-point improvement from Round 1 to Round 4, p = .011). Most notably, stake size tracked confidence. "Whale" bets of 40,000+ coins were correct ~99% of the time, while small bets (<1,000 coins) showed only ~74% accuracy. The key finding is not that fictional money makes models smarter; accuracy gains were modest and did not reach statistical significance (p = .089) in this pilot. Rather, the betting mechanic created a legible confidence signal absent from binary yes/no outputs. This suggests that simple financial framing may help transform LLMs into risk-aware forecasters, making their internal beliefs visible and usable. The protocol offers a foundation for future work for meta-evaluation systems and what may become LLM-to-LLM prediction markets.

accuracy, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.17605/OSF.IO/DC24T

2512.05998

Country: North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry: Banking & Finance > Trading (0.81)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Plantain: Plan-Answer Interleaved Reasoning

Liang, Anthony, Berant, Jonathan, Fisch, Adam, Goyal, Abhimanyu, Krishna, Kalpesh, Eisenstein, Jacob

arXiv.org Artificial IntelligenceDec-4-2025

Reasoning models often spend a significant amount of time thinking before they generate a visible response. In the meantime, they do not give the user any hints as to whether their reasoning is on the right track, and do not give the user any recourse to stop and correct them if their reasoning is flawed. This creates a frustrating, but unfortunately common, experience: the user's time is wasted while the model reasons from a false premise that could have easily been corrected. In contrast, human speakers typically perform lightweight, incremental grounding acts to ensure that participants in the conversation are on the same page; here we ask if language models can learn to leverage a similar type of behavior? With this motivation, we propose interleaved reasoning (IR), in which the model alternates between thinking and surfacing intermediate responses, as an alternative to the standard "think-then-answer" approach. By providing useful information to the user earlier, IR reduces perceived latency, the time a user waits for an initial output, without compromising the quality of the final response. We further introduce a specialization of interleaved reasoning, Plantain (Plan-Thought-Answer Interleaving), where the first intermediate response is an explicit, step-by-step plan for executing the task. This plan-first strategy allows for user intervention and early feedback for subsequent reasoning steps. We demonstrate that Plantain yields an ~6% improvement in pass@1 across several challenging math reasoning and coding benchmarks, while reducing time-to-first-response by over 60% relative to think-then-answer baselines.

artificial intelligence, arxivpreprintarxiv, natural language, (17 more...)

arXiv.org Artificial Intelligence

2512.03176

Country: Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.67)

Add feedback

Evidence-Guided Schema Normalization for Temporal Tabular Reasoning

Thanga, Ashish, Dixit, Vibhu, Shankarampeta, Abhilash, Gupta, Vivek

arXiv.org Artificial IntelligenceDec-2-2025

Temporal reasoning over evolving semi-structured tables poses a challenge to current QA systems. We propose a SQL-based approach that involves (1) generating a 3NF schema from Wikipedia infoboxes, (2) generating SQL queries, and (3) query execution. Our central finding challenges model scaling assumptions: the quality of schema design has a greater impact on QA precision than model capacity. We establish three evidence-based principles: normalization that preserves context, semantic naming that reduces ambiguity, and consistent temporal anchoring. Our best configuration (Gemini 2.5 Flash schema + Gemini-2.0-Flash queries) achieves 80.39 EM, a 16.8\% improvement over the baseline (68.89 EM).

foreign key, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2512.00329

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Europe > Gibraltar (0.04)
North America > Barbados (0.04)
(5 more...)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Sports (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Testing Transformer Learnability on the Arithmetic Sequence of Rooted Trees

Breccia, Alessandro, Gerace, Federica, Lippi, Marco, Sicuro, Gabriele, Contucci, Pierluigi

arXiv.org Artificial IntelligenceDec-2-2025

Prime factorization, the decomposition of a natural number into its constituent primes, lies at the crossroads of arithmetic, complexity theory, and computational practice. While every integer admits a unique factorization, the operational effort required to obtain it grows quickly with its magnitude. State-of-the-art algorithms achieve remarkable performance for moderately large inputs, yet their complexity escalates rapidly when confronted with truly large instances. Moreover, in this limit, the sequence of integers with known prime factorizations becomes effectively sparse, with regions where the factorizations of intermediate values are computationally inaccessible. It is therefore natural to ask whether modern machine learning methods, and more specifically Large Language Models (LLMs), can offer any advantages from this perspective.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2512.0187

Country: