source2synth
Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis
Chen, Linqing, Zhong, Hanmeng, Wu, Wentao, Wang, Weilei
Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present \textbf{Semantic Bridge}, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is \textit{semantic graph weaving}-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.
- North America > Dominican Republic (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Asia (0.04)
Large Language Model's Multi-Capability Alignment in Biomedical Domain
Wu, Wentao, Chen, Linqing, Zhong, Hanmeng, Wang, Weilei
BalancedBio, a theoretically-grounded framework for parameter-efficient biomedical reasoning that addresses the fundamental challenge of multi-capability integration in domain-specific AI alignment. We establish the Biomedical Multi-Capability Convergence Theorem, proving that balanced development of domain expertise, reasoning, and instruction-following requires orthogonal gradient spaces to prevent capability interference--a critical requirement for safe biomedical AI deployment. Our approach introduces two key innovations: (1) Medical Knowledge-Grounded Synthetic Generation (MKGSG), which extends Source2Synth by incorporating clinical workflow constraints and medical ontology validation to ensure both factual accuracy and clinical safety; and (2) Capability-A ware Group Relative Policy Optimization, where we theoretically derive optimal hybrid reward weighting strategies that maintain capability orthogonality during reinforcement learning, incorporating a reward model that scores business data adapted to biomedical downstream tasks, achieving true multi-dimensional hybrid RL with both rule-based and model-based scores . Through rigorous mathematical analysis, we prove that our training objective achieves Pareto-optimal convergence where improvements in one capability domain preserve performance in others--addressing a fundamental alignment challenge in medical AI. BalancedBio demonstrates state-of-the-art performance within its parameter class: domain expertise (80.95% BIOMED-MMLU, +15.32% over best baseline), reasoning capabilities (61.94%, +7.75%), instruction-following (67.95%, +6.44%), and integration score (86.7%, +18.5%). Critically, we provide theoretical safety guarantees with formal bounds on capability preservation and clinical accuracy maintenance. Real-world deployment across healthcare institutions validates practical impact: 78% cost reduction, 23% improved diagnostic accuracy, and 89% clinician acceptance rate. Our work establishes a principled methodology for biomedical AI alignment, demonstrating that sophisticated reasoning capabilities can be achieved efficiently while maintaining safety and reliability constraints essential for medical applications. We will release the 0.5B V ersion of our model.
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Lupidi, Alisia, Gemmell, Carlos, Cancedda, Nicola, Dwivedi-Yu, Jane, Weston, Jason, Foerster, Jakob, Raileanu, Roberta, Lomeli, Maria
Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.
- North America > United States (0.47)
- North America > Mexico (0.04)
- North America > Canada (0.04)
- (5 more...)