AITopics | gsm8k

Country:

North America > United States > Virginia (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > France (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Neural Information Processing SystemsFeb-17-2026, 16:53:17 GMT

Training Chain-of-Thought via Latent-Variable Inference Du Phan Matthew D. Hoffman

One can also improve LLMs' performance on a specific task by

large language model, machine learning, natural language, (18 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Italy > Tuscany > Florence (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Neural Information Processing SystemsFeb-17-2026, 05:41:24 GMT

Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning

Recent successes in natural language processing have led to the proliferation of large language models (LLMs) by multiple providers. Each LLM offering has different inference accuracy, monetary cost, and latency, and their accuracy further depends on the exact wording of the question ( i .

large language model, machine learning, natural language, (21 more...)

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
North America > Mexico (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.92)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-13-2026, 09:40:39 GMT

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Further analysis suggests a positive relationship (Spearman's r

large language model, machine learning, natural language, (19 more...)

Country:

Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Indonesia > Bali (0.04)
(7 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsFeb-11-2026, 14:45:58 GMT

3d5aa9a7ce28cdc710fbd044fd3610f3-Paper-Datasets_and_Benchmarks_Track.pdf

large language model, machine learning, natural language, (21 more...)

Country:

North America > Canada (0.04)
Europe > Monaco (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Indonesia > Bali (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Leisure & Entertainment > Sports > Basketball (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Neural Information Processing SystemsDec-24-2025, 18:11:19 GMT

Large Language Models are Zero-Shot Reasoners

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding ``Let's think step by step'' before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g.

language model, name change, zero-shot reasoner, (9 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Simoni, Marco, Fontana, Aleksandar, Rossolini, Giulio, Saracino, Andrea, Mori, Paolo

GTPO: Stabilizing Group Relative Policy Optimization via Gradient and Entropy Control

arXiv.org Artificial IntelligenceDec-12-2025

Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and analyze two main GRPO issues: (i) the token-level penalization, where valuable tokens shared across different responses receive contradictory feedback signals, leading to conflicting gradient updates that can reduce their likelihood; and (ii) the policy collapse, where negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, destabilizing training process. To address these issues we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which prevents conflicting gradients on valuable tokens by skipping negative updates while amplifying positive ones and filters out completions whose entropy exceeds a provable threshold, to prevent policy collapse. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, as validated through multiple experiments on GSM8K, MA TH, AIME 2024, AIME 2025 and AMC 2023.

completion, large language model, machine learning, (18 more...)

2508.03772

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > Canada > British Columbia > Vancouver (0.04)
Europe > Italy (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-2-2025

Exploring System 1 and 2 communication for latent reasoning in LLMs

Coda-Forno, Julian, Zhao, Zhuokai, Zhang, Qiang, Tamboli, Dipesh, Li, Weiwei, Fan, Xiangjun, Zhang, Lizhu, Schulz, Eric, Tseng, Hsiao-Ping

Should LLM reasoning live in a separate module, or within a single model's forward pass and representational space? We study dual-architecture latent reasoning, where a fluent Base exchanges latent messages with a Coprocessor, and test two hypotheses aimed at improving latent communication over Liu et al. (2024): (H1) increase channel capacity; (H2) learn communication via joint finetuning. Under matched latent-token budgets on GPT-2 and Qwen-3, H2 is consistently strongest while H1 yields modest gains. A unified soft-embedding baseline, a single model with the same forward pass and shared representations, using the same latent-token budget, nearly matches H2 and surpasses H1, suggesting current dual designs mostly add compute rather than qualitatively improving reasoning. Across GSM8K, ProsQA, and a Countdown stress test with increasing branching factor, scaling the latent-token budget beyond small values fails to improve robustness. Latent analyses show overlapping subspaces with limited specialization, consistent with weak reasoning gains. We conclude dual-model latent reasoning remains promising in principle, but likely requires objectives and training schedules that explicitly shape latent spaces for algorithmic planning.

large language model, machine learning, natural language, (19 more...)

2510.00494

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)

Genre: Research Report (0.64)

Industry:

Health & Medicine (0.46)
Energy (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Shen, Jucheng, Ro, Yeonju

Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models

arXiv.org Artificial IntelligenceDec-1-2025

Masked diffusion language models (MDLMs) are becoming competitive with their autoregressive counterparts but typically decode with fixed steps and sequential unmasking. To accelerate decoding, recent work such as Fast-dLLM enables parallel decoding via a static global confidence threshold, yet we observe strong block- and step-wise confidence fluctuations and, within a dataset, near-identical confidence trajectories across inputs as measured by cosine similarity. Motivated by these observations, we introduce One-Shot Dynamic Thresholding (OSDT), which calibrates thresholds on a single sequence and applies them to subsequent inputs with negligible overhead. On GPQA, GSM8K, and HumanEval, OSDT attains superior accuracy-throughput trade-offs (+24% tokens/s on GSM8K at the best accuracy, +45% on GPQA with comparable accuracy, and +50% on HumanEval with a modest accuracy gap). Beyond these results, our findings suggest broader opportunities to leverage reusable task-level confidence signatures for more general-purpose algorithmic and systems innovations in diffusion decoding.

artificial intelligence, natural language, threshold, (16 more...)

2511.02077

Country: North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report > New Finding (0.54)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Acker, Alexander, Becker, Soeren, Nedelkoski, Sasho, Scheinert, Dominik, Kao, Odej, Wiesner, Philipp

What happens when nanochat meets DiLoCo?

arXiv.org Artificial IntelligenceNov-19-2025

Although LLM training is typically centralized with high-bandwidth interconnects and large compute budgets, emerging methods target communication-constrained training in distributed environments. The model trade-offs introduced by this shift remain underexplored, and our goal is to study them. We use the open-source nanochat project, a compact 8K-line full-stack ChatGPT-like implementation containing tokenization, pretraining, fine-tuning, and serving, as a controlled baseline. We implement the DiLoCo algorithm as a lightweight wrapper over nanochat's training loop, performing multiple local steps per worker before synchronization with an outer optimizer, effectively reducing communication by orders of magnitude. This inner-outer training is compared against a standard data-parallel (DDP) setup. Because nanochat is small and inspectable, it enables controlled pipeline adaptations and allows direct comparison with the conventional centralized baseline. DiLoCo achieves stable convergence and competitive loss in pretraining but yields worse MMLU, GSM8K, and HumanEval scores after mid-training and SFT. We discover that using DiLoCo-pretrained weights and running mid- and post-training with DDP fails to recover performance, revealing irreversible representation drift from asynchronous updates that impairs downstream alignment. We provide this implementation as an official fork of nanochat on GitHub.

large language model, machine learning, natural language, (21 more...)

2511.13761

Country: Europe > Germany > Berlin (0.04)

Genre: Research Report (0.85)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)