Generative AI
The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs
Chen, Jierun, Yu, Tiezheng, Bai, Haoli, Yao, Lewei, Wu, Jiannan, Li, Kaican, Mi, Fei, Tao, Chaofan, Zhu, Lei, Zhang, Manyi, Li, Xiaohui, Hou, Lu, Shang, Lifeng, Liu, Qun
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This "synergy dilemma" highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.Figure 1: Accuracy gains from various post-training techniques across five difficulty levels (L1, easy to L5, hard) on five multimodal reasoning benchmarks. Long-CoT SFT boosts Qwen2.5-VL-7B on harder questions but hurts easier ones, while RL yields steady gains across the board. Hybrid strategies consistently trade off strengths rather than achieving true synergy. Large language models (LLMs) like OpenAI's o1/o3 (Jaech et al., 2024) and DeepSeek-R1 (Guo et al., 2025) have demonstrated remarkable reasoning abilities by thinking before answering .
Position: We Need An Algorithmic Understanding of Generative AI
Eberle, Oliver, McGee, Thomas, Giaffar, Hamza, Webb, Taylor, Momennejad, Ida
What algorithms do LLMs actually learn and use to solve problems? Studies addressing this question are sparse, as research priorities are focused on improving performance through scale, leaving a theoretical and empirical gap in understanding emergent algorithms. This position paper proposes AlgEval: a framework for systematic research into the algorithms that LLMs learn and use. AlgEval aims to uncover algorithmic primitives, reflected in latent representations, attention, and inference-time compute, and their algorithmic composition to solve task-specific problems. We highlight potential methodological paths and a case study toward this goal, focusing on emergent search algorithms. Our case study illustrates both the formation of top-down hypotheses about candidate algorithms, and bottom-up tests of these hypotheses via circuit-level analysis of attention patterns and hidden states. The rigorous, systematic evaluation of how LLMs actually solve tasks provides an alternative to resource-intensive scaling, reorienting the field toward a principled understanding of underlying computations. Such algorithmic explanations offer a pathway to human-understandable interpretability, enabling comprehension of the model's internal reasoning performance measures. This can in turn lead to more sample-efficient methods for training and improving performance, as well as novel architectures for end-to-end and multi-agent systems.
MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning
Tran, Hieu, Yao, Zonghai, Jang, Won Seok, Sultana, Sharmin, Chang, Allen, Zhang, Yuan, Yu, Hong
Generative AI has demonstrated strong potential in healthcare, from clinical decision support to patient-facing chatbots that improve outcomes. A critical challenge for deployment is effective human-AI communication, where content must be both personalized and understandable. We introduce MedReadCtrl, a readability-controlled instruction tuning framework that enables LLMs to adjust output complexity without compromising meaning. Evaluations of nine datasets and three tasks across medical and general domains show that MedReadCtrl achieves significantly lower readability instruction-following errors than GPT-4 (e.g., 1.39 vs. 1.59 on ReadMe, p<0.001) and delivers substantial gains on unseen clinical tasks (e.g., +14.7 ROUGE-L, +6.18 SARI on MTSamples). Experts consistently preferred MedReadCtrl (71.7% vs. 23.3%), especially at low literacy levels. These gains reflect MedReadCtrl's ability to restructure clinical content into accessible, readability-aligned language while preserving medical intent, offering a scalable solution to support patient education and expand equitable access to AI-enabled care.
Dr. ChatGPT Will See You Now
A poster on Reddit lived with a painful clicking jaw, the result of a boxing injury, for five years. They saw specialists, got MRIs, but no one could give them a solution to fix it, until they described the problem to ChatGPT. The AI chatbot suggested a specific jaw-alignment issue might be the problem and offered a technique involving tongue placement as a treatment. The individual tried it, and the clicking stopped. "After five years of just living with it," they wrote on Reddit in April, "this AI gave me a fix in a minute."
Elon Musk Unveils Grok 4 Amid Controversy Over Chatbot's Antisemitic Posts
Elon Musk on Thursday unveiled Grok 4, the latest AI model from xAI, his multibillion-dollar initiative to rival OpenAI and Google. Without citing detailed evidence, Musk claimed that the model aces standardized tests and exhibits doctorate-level knowledge in a wide array of different disciplines. "Grok 4 is a postgrad-level in everything," Musk said during an hour-long live broadcast, which began after midnight in New York. "At least with respect to academic questions, Grok 4 is better than PhD level in every subject. Competing AI developers, such as OpenAI and Google, have routinely released similar publications for their models.
Test-Time Scaling with Reflective Generative Model
Wang, Zixiao, Wang, Yuxin, Wang, Xiaorui, Xing, Mengting, Gao, Jie, Xu, Jianjun, Liu, Guangcan, Jin, Chenhui, Wang, Zhuo, Zhang, Shengzhuo, Xie, Hongtao
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
Federated Breast Cancer Detection Enhanced by Synthetic Ultrasound Image Augmentation
Pan, Hongyi, Hong, Ziliang, Durak, Gorkem, Xu, Ziyue, Bagci, Ulas
Federated learning (FL) has emerged as a promising paradigm for collaboratively training deep learning models across institutions without exchanging sensitive medical data. However, its effectiveness is often hindered by limited data availability and non-independent, identically distributed data across participating clients, which can degrade model performance and generalization. To address these challenges, we propose a generative AI based data augmentation framework that integrates synthetic image sharing into the federated training process for breast cancer diagnosis via ultrasound images. Specifically, we train two simple class-specific Deep Convolutional Generative Adversarial Networks: one for benign and one for malignant lesions. We then simulate a realistic FL setting using three publicly available breast ultrasound image datasets: BUSI, BUS-BRA, and UDIAT. FedAvg and FedProx are adopted as baseline FL algorithms. Experimental results show that incorporating a suitable number of synthetic images improved the average AUC from 0.9206 to 0.9237 for FedAvg and from 0.9429 to 0.9538 for FedProx. We also note that excessive use of synthetic data reduced performance, underscoring the importance of maintaining a balanced ratio of real and synthetic samples. Our findings highlight the potential of generative AI based data augmentation to enhance FL results in the breast ultrasound image classification task.
Towards LLM-based Root Cause Analysis of Hardware Design Failures
Qiu, Siyu, Wang, Muzhi, Afsharmazayejani, Raheel, Shahmiri, Mohammad Moradi, Tan, Benjamin, Pearce, Hammond
--With advances in large language models (LLMs), new opportunities have emerged to develop tools that support the digital hardware design process. In this work, we explore how LLMs can assist with explaining the root cause of design issues and bugs that are revealed during synthesis and simulation, a necessary milestone on the pathway towards widespread use of LLMs in the hardware design process and for hardware security analysis. We find promising results: for our corpus of 34 different buggy scenarios, OpenAI's o3-mini reasoning model reached a correct determination 100% of the time under pass@5 scoring, with other state of the art models and configurations usually achieving more than 80% performance and more than 90% when assisted with retrieval-augmented generation. Encountering bugs, glitches, and faults is a normal part of the digital hardware design lifecycle. To ensure they are completely removed and repaired is a time-consuming process requiring a deep understanding of both the technical cause of the issue as well as any impacts on the broader hardware system - particularly as any missed repair may have severe downstream functional and/or security consequences [1] (if the bug is of an exploitable nature). However, as digital hardware grows in complexity, so do the frequency and nature of the bugs themselves.
Assessing the Prevalence of AI-assisted Cheating in Programming Courses: A Pilot Study
Abstract-- Tools that can generate computer code in response to inputs written in natural language, such as ChatGPT, pose an existential threat to Computer Science education in its current form, since students can now use these tools to solve assignments without much effort. While that risk has already been recognized by scholars, the proportion of the student body that is incurring in this new kind of plagiarism is still an open problem. We conducted a pilot study in a large CS class (n=120) to assess the feasibility of estimating AI plagiarism through anonymous surveys and interviews. More than 25% of the survey respondents admitted to committing AI plagiarism. Conversely, only one student accepted to be interviewed. Given the high levels of misconduct acknowledgment, we conclude that surveys are an effective method for studies on the matter, while interviews should be avoided or designed in a way that can entice participation. 1 INTRODUCTION Generative artificial intelligence (GenAI, not to be confused with general The generation is usually guided by an input text known as the "prompt". For example, giving the prompt "a vase of red flowers" to a GenAI model would generate an image depicting red flowers in a vase. Practical applications of GenAI are now mainstream thanks to advances in neural networks. In particular, the clever use of attention mechanisms and the subsequent development of the transformer architecture made efficient learning possible over large text corpora (Vaswani et al., 2023) . AI application based on a LLM, can convincingly engage in a conversation and answer questions across multiple subjects (OpenAI, 2022) . Research on applications of LLMs in education is still in its infancy, but looks promising. Personal tutoring systems (Chang, 2022), content explanation (Leinonen et al., 2023) and assignment generation ( Jury et al., 2024) are a few of the ideas that have been explored. From another perspective, LLMs are already a reality in schools.
The AI Industry is Funding A Massive AI Training Initiative for Teachers
AI tools have become deeply embedded in how many students learn and complete schoolwork--and that usage is only poised to increase. On Tuesday, the American Federation of Teachers announced an AI training hub for educators, backed by 23 million from Microsoft, OpenAI, and Anthropic. The AFT is the second-largest teachers' union, representing 1.8 million teachers and educational staffers across the country. Their training hub will open in New York City this fall, featuring workshops that will educate teachers on how to use AI tools for tasks like generating lesson plans and quizzes, or writing emails to parents. Microsoft is providing 12.5 million for AI teacher training over the next five years.