Education
Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum
Yang, Xinglong, Feng, Quan, Pan, Zhongying, Chen, Xiang, Tian, Yu, Li, Wentong, Qiao, Shuofei, Geng, Yuxia, Zhao, Xingyu, Huang, Sheng-Jun
The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often limited by the use of randomly or manually selected examples. These examples fail to account for both model-specific knowledge distributions and the intrinsic complexity of the tasks, resulting in suboptimal and unstable model performance. To address this, we propose a novel framework inspired by the pedagogical principle of "tailored teaching with balanced difficulty". We reframe prompt selection as a prompt curriculum design problem: constructing a well ordered set of training examples that align with the model's current capabilities. Our approach integrates two complementary signals: (1) model-perceived difficulty, quantified through prediction disagreement in an active learning setup, capturing what the model itself finds challenging; and (2) intrinsic sample complexity, which measures the inherent difficulty of each question-image pair independently of any model. By jointly analyzing these signals, we develop a difficulty-balanced sampling strategy that ensures the selected prompt examples are diverse across both dimensions. Extensive experiments conducted on five challenging benchmarks and multiple popular Multimodal Large Language Models (MLLMs) demonstrate that our method yields substantial and consistent improvements and greatly reduces performance discrepancies caused by random sampling, providing a principled and robust approach for enhancing multimodal reasoning.
Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny
Yan, Chuanhao, Che, Fengdi, Huang, Xuhan, Xu, Xu, Li, Xin, Li, Yizhi, Qu, Xingwei, Shi, Jingzhe, Lin, Chenghua, Yang, Yaodong, Yuan, Binhang, Zhao, Hang, Qiao, Yu, Zhou, Bowen, Fu, Jie
Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.
Efficient Compositional Multi-tasking for On-device Large Language Models
Bohdal, Ondrej, Ozay, Mete, Moon, Jijoong, Lee, Kyeng-Hun, Ko, Hyeonmok, Michieli, Umberto
Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.
Manimator: Transforming Research Papers into Visual Explanations
P, Samarth, Jain, Vyoman, Golugula, Shiva, Sathvik, Motamarri Sai
Understanding complex scientific and mathematical concepts, particularly those presented in dense research papers, poses a significant challenge for learners. Dynamic visualizations can greatly enhance comprehension, but creating them manually is time-consuming and requires specialized knowledge and skills. We introduce manimator, an open-source system that leverages Large Language Models to transform research papers and natural language prompts into explanatory animations using the Manim engine. Manimator employs a pipeline where an LLM interprets the input text or research paper PDF to generate a structured scene description outlining key concepts, mathematical formulas, and visual elements and another LLM translates this description into executable Manim Python code. We discuss its potential as an educational tool for rapidly creating engaging visual explanations for complex STEM topics, democratizing the creation of high-quality educational content.
RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services
Zhao, Fei, Lu, Chonggang, Wang, Yue, Xie, Zheyong, Liu, Ziyan, Qian, Haofu, Huang, JianZhao, Shi, Fangcheng, Meng, Zijie, Guo, Hongcheng, He, Mingqian, Lyu, Xinze, Lu, Yiming, Xiang, Ziyang, Ye, Zheyu, Lu, Chengqiang, Xu, Zhe, Wu, Yi, Hu, Yao, Gao, Yan, Fan, Jun, Jiang, Xiaolong, Liu, Weiting, Wang, Boyang, Cao, Shaosheng
As a primary medium for modern information dissemination, social networking services (SNS) have experienced rapid growth, which has proposed significant challenges for platform content management and interaction quality improvement. Recently, the development of large language models (LLMs) has offered potential solutions but existing studies focus on isolated tasks, which not only encounter diminishing benefit from the data scaling within individual scenarios but also fail to flexibly adapt to diverse real-world context. To address these challenges, we introduce RedOne, a domain-specific LLM designed to break the performance bottleneck of single-task baselines and establish a comprehensive foundation for the SNS. RedOne was developed through a three-stage training strategy consisting of continue pretraining, supervised fine-tuning, and preference optimization, using a large-scale real-world dataset. Through extensive experiments, RedOne maintains strong general capabilities, and achieves an average improvement up to 14.02% across 8 major SNS tasks and 7.56% in SNS bilingual evaluation benchmark, compared with base models. Furthermore, through online testing, RedOne reduced the exposure rate in harmful content detection by 11.23% and improved the click page rate in post-view search by 14.95% compared with single-tasks finetuned baseline models. These results establish RedOne as a robust domain-specific LLM for SNS, demonstrating excellent generalization across various tasks and promising applicability in real-world scenarios.
Understanding and Improving Length Generalization in Recurrent Models
Ruiz, Ricardo Buitrago, Gu, Albert
In addition to matching the performance of Transformers (Vaswani et al. 2017) across many tasks, the recurrent mechanism brings two benefits: the ability to efficiently process long sequences thanks to its linear complexity, and the capacity to easily process tokens beyond their training context by simply rolling out the state. Nevertheless, in practice these benefits are often unrealized, given that their performance can drop considerably when the sequence length exceeds their training context (Ben-Kish et al. 2024; Waleffe et al. 2024; Ye et al. 2025; Yuan et al. 2024). This naturally leads to two questions: (1) why do these models fail to length generalize? and (2) how can we efficiently enable length generalization across several recurrent models? Recently, some works have studied the length generalization of Mamba (Dao and Gu 2024) and have proposed solutions such as forcing the model to forget previous context (Yingfa Chen et al. 2024) or skipping tokens in the state update to reduce the effective context of the processed sequence (Ben-Kish et al. 2024; Ye et al. 2025). However, these methods require changing the internal mechanism of Mamba and might not be easily transferable to other architectures. Other works have linked length generalization to state capacity and overfitting (Yingfa Chen et al. 2024; S. Wang 2024), proposing training on longer sequences and with Truncated Backpropagation Through Time (TBTT) (Sutskever 2013; Williams and J. Peng 1990) as a way to enable length generalization. In this work, we reason about the distribution of states that the model is trained on to introduce a precise hypothesis that explains why recurrent models fail to length generalize. Moreover, we perform comprehensive interventions that elucidate on what distributions recurrent models need to be trained to enable length generalization. 1
GradMetaNet: An Equivariant Architecture for Learning on Gradients
Gelberg, Yoav, Eitan, Yam, Navon, Aviv, Shamsian, Aviv, Theo, null, Putterman, null, Bronstein, Michael, Maron, Haggai
Gradients of neural networks encode valuable information for optimization, editing, and analysis of models. Therefore, practitioners often treat gradients as inputs to task-specific algorithms, e.g. for pruning or optimization. Recent works explore learning algorithms that operate directly on gradients but use architectures that are not specifically designed for gradient processing, limiting their applicability. In this paper, we present a principled approach for designing architectures that process gradients. Our approach is guided by three principles: (1) equivariant design that preserves neuron permutation symmetries, (2) processing sets of gradients across multiple data points to capture curvature information, and (3) efficient gradient representation through rank-1 decomposition. Based on these principles, we introduce GradMetaNet, a novel architecture for learning on gradients, constructed from simple equivariant blocks. We prove universality results for GradMetaNet, and show that previous approaches cannot approximate natural gradient-based functions that GradMetaNet can. We then demonstrate GradMetaNet's effectiveness on a diverse set of gradient-based tasks on MLPs and transformers, such as learned optimization, INR editing, and estimating loss landscape curvature.
A Community-driven vision for a new Knowledge Resource for AI
Chaudhri, Vinay K, Baru, Chaitan, Bennett, Brandon, Bhatt, Mehul, Cassel, Darion, Cohn, Anthony G, Dechter, Rina, Erdem, Esra, Ferrucci, Dave, Forbus, Ken, Gelfond, Gregory, Genesereth, Michael, Gordon, Andrew S., Grosof, Benjamin, Gupta, Gopal, Hendler, Jim, Israni, Sharat, Josephson, Tyler R., Kyllonen, Patrick, Lierler, Yuliya, Lifschitz, Vladimir, McFate, Clifton, McGinty, Hande K., Morgenstern, Leora, Oltramari, Alessandro, Paritosh, Praveen, Roth, Dan, Shepard, Blake, Shimzu, Cogan, Vrandeฤiฤ, Denny, Whiting, Mark, Witbrock, Michael
The Cyc project, started in 1984, created the first large-scale database of commonsense knowledge. The initiative continues to this day with its aim to provide a comprehensive ontology and knowledge base of commonsense knowledge to enable human-like reasoning for AI systems. In the concluding paragraph of his Communications of the Association of Computing Machinery (CACM) 1995 article A Large-Scale Investment in Knowledge Infrastructure [52], Cyc's founder Douglas B. Lenat wrote: Is Cyc necessary? How far would a user get with something simpler than Cyc but that lacks everyday commonsense knowledge? Nobody knows; the question will be settled empirically. Our guess is most of these applications will eventually tap the synergy in a suite of sources (including neural nets and decision theory), one of which will be Cyc. Although 30 years have passed since the above article was written, AI research community has not conclusively settled [10] the question "How far would a user get with something simpler than Cyc but that lacks everyday commonsense knowledge?" However, it is clear that significant strides have been made in addressing many of the tasks that were original Cyc use cases, including information retrieval, semi-automatically linking multiple heterogeneous external information sources, spelling and grammar correction, machine translation, natural language understanding and speech understanding.
Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features
Sung, Hakyung, Csuros, Karla, Sung, Min-Chang
This study examines the lexical and syntactic interventions of human and LLM proofreading aimed at improving overall intelligibility in identical second language writings, and evaluates the consistency of outcomes across three LLMs (ChatGPT-4o, Llama3.1-8b, Deepseek-r1-8b). Findings show that both human and LLM proofreading enhance bigram lexical features, which may contribute to better coherence and contextual connectedness between adjacent words. However, LLM proofreading exhibits a more generative approach, extensively reworking vocabulary and sentence structures, such as employing more diverse and sophisticated vocabulary and incorporating a greater number of adjective modifiers in noun phrases. The proofreading outcomes are highly consistent in major lexical and syntactic features across the three models.
Understanding the Impact of Sampling Quality in Direct Preference Optimization
Kim, Kyung Rok, Bai, Yumo, Wang, Chonghuan, Chen, Guanting
We study how data of higher quality can be leveraged to improve performance in Direct Preference Optimization (DPO), aiming to understand its impact on DPO training dynamics. Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the data-generating distribution. We first analyze how data and reference policy influence policy updates during gradient descent, and how a practical phenomenon known as likelihood displacement can interfere with the desired dynamics. We then design a simplified yet well-structured alignment model as a proxy that preserves most of the beneficial properties of RLHF while avoiding likelihood displacement. Based on this model, we develop quantitative results showing how more frequent high-quality responses amplify the gradient signal and improve the optimization landscape, leading to more effective policy learning. Our theoretical findings are supported by empirical experiments and provide a principled justification for the online DPO framework in practice.