AITopics | sft model

Country:

South America > Argentina (0.04)
North America > United States > Virginia (0.04)
South America > Uruguay (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-18-2026, 03:01:25 GMT

fc65fab891d83433bd3c8d966edde311-Paper-Conference.pdf

artificial intelligence, machine learning, natural language, (16 more...)

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Middle East > Jordan (0.04)

Industry: Education > Educational Setting (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(2 more...)

Neural Information Processing SystemsFeb-17-2026, 04:43:18 GMT

Checklist

Do the main claims made in the abstract and introduction accurately reflect the paper's Did you describe the limitations of your work? Did you specify all the training details (e.g., data splits, hyperparameters, how they Did you report error bars (e.g., with respect to the random seed after running experi-20 Did you include the total amount of compute and the type of resources used (e.g., type If your work uses existing assets, did you cite the creators? Did you mention the license of the assets? Did you include any new assets either in the supplemental material or as a URL? [Y es] Did you discuss whether and how consent was obtained from people whose data you're We thereby state that we bear all responsibility in case of violation of rights, etc., and confirmation of F or what purpose was the dataset created? - For the novel task of data analysis as explained Who created the dataset and on behalf of which entity? - This dataset is created during a Who funded the creation of the dataset? What do the instances that comprise the dataset represent?

large language model, machine learning, natural language, (22 more...)

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Industry:

Banking & Finance (0.96)
Health & Medicine (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.32)

Tarunokusumo, Ravindra Aribowo, Cunha, Rafael Fernandes

Boosting Accuracy and Efficiency of Budget Forcing in LLMs via Reinforcement Learning for Mathematical Reasoning

arXiv.org Artificial IntelligenceOct-27-2025

Test-time scaling methods have seen a rapid increase in popularity for its computational efficiency and parameter-independent training to improve reasoning performance on Large Language Models. One such method is called budget forcing, a decoding intervention strategy which allocates extra compute budget for thinking and elicits the inherent self-correcting behavior of the model. However, this relies on supervised fine-tuning (SFT) on long-context reasoning traces which causes performance degradation on smaller models due to verbose responses. For this reason, we offer a framework integrating reinforcement learning (RL) to improve token efficiency and boost the performance of a 1.5B model for mathematical reasoning. We demonstrate this using only 1.5K training samples and found that our SFT+RL model performed better on the GSM8K dataset with varying compute budgets. Our main findings showed an overall higher accuracy while significantly reducing its token usage by over 40% compared to the SFT model, revealing how RL can recover the losses due to long-context training and altogether improving performance in mathematical reasoning.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

2510.21398

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Neural Information Processing SystemsOct-10-2025, 19:12:16 GMT

SimPO: Simple Preference Optimization with a Reference-Free Reward Y u Meng

Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further improving the algorithm's performance.

large language model, machine learning, natural language, (19 more...)

Country:

South America > Argentina (0.04)
North America > United States > Virginia (0.04)
South America > Uruguay (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-9-2025, 12:37:24 GMT

DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models Ying Fan,1,2, Olivia Watkins

Learning from human feedback has been shown to improve text-to-image models.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Middle East > Jordan (0.04)

Industry: Education > Educational Setting (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

arXiv.org Artificial IntelligenceOct-7-2025

Behavior Injection: Preparing Language Models for Reinforcement Learning

Cen, Zhepeng, Yao, Yihang, Han, William, Liu, Zuxin, Zhao, Ding

Reinforcement learning (RL) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RL finetuning: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key conditions for effective post-training: (1) RL-informative rollout accuracy, and (2) strong data co-influence, which quantifies how much the training data affects performance on other samples. Guided by these insights, we propose behavior injection, a task-agnostic data augmentation scheme applied prior to RL. Behavior injection enriches the supervised finetuning (SFT) data by seeding exploratory and exploitative behaviors, effectively making the model more RL-ready. We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increase the performance gain from RL over the pre-RL model.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

2505.18917

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Li, Jiangnan, Vu, Thuy-Trang, Abbasnejad, Ehsan, Haffari, Gholamreza

Beyond Imitation: Recovering Dense Rewards from Demonstrations

arXiv.org Artificial IntelligenceOct-6-2025

Conventionally, supervised fine-tuning (SFT) is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on demonstration datasets. In this work, we challenge this view by establishing a fundamental equivalence between SFT and Inverse Reinforcement Learning. We prove that the SFT objective is a special case of Inverse Q-Learning, which implies that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations. We then show how to recover this dense reward signal directly from the SFT model by formulating a baseline-relative reward function. The availability of such a dense reward model offers numerous benefits, providing granular credit assignment for each token generated. We demonstrate one key application by using these recovered rewards to further improve the policy with reinforcement learning. Our method, Dense-Path REINFORCE, consistently outperforms the original SFT models on instruction-following benchmarks. This work reframes SFT not merely as policy imitation but as a powerful reward learning mechanism, opening new possibilities for leveraging expert demonstrations.

large language model, machine learning, sft, (17 more...)

2510.02493

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)

arXiv.org Artificial IntelligenceOct-3-2025

Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

Kang, Feiyang, Kuchnik, Michael, Padthe, Karthik, Vlastelica, Marin, Jia, Ruoxi, Wu, Carole-Jean, Ardalani, Newsha

In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as ``RL'' below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving $R^2$ coefficient and Spearman's rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. Evaluation tool will be open-sourced.

arxiv preprint arxiv, large language model, machine learning, (18 more...)