Goto

Collaborating Authors

 source2synth


Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis

Chen, Linqing, Zhong, Hanmeng, Wu, Wentao, Wang, Weilei

arXiv.org Artificial Intelligence

Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present \textbf{Semantic Bridge}, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is \textit{semantic graph weaving}-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.


Large Language Model's Multi-Capability Alignment in Biomedical Domain

Wu, Wentao, Chen, Linqing, Zhong, Hanmeng, Wang, Weilei

arXiv.org Artificial Intelligence

BalancedBio, a theoretically-grounded framework for parameter-efficient biomedical reasoning that addresses the fundamental challenge of multi-capability integration in domain-specific AI alignment. We establish the Biomedical Multi-Capability Convergence Theorem, proving that balanced development of domain expertise, reasoning, and instruction-following requires orthogonal gradient spaces to prevent capability interference--a critical requirement for safe biomedical AI deployment. Our approach introduces two key innovations: (1) Medical Knowledge-Grounded Synthetic Generation (MKGSG), which extends Source2Synth by incorporating clinical workflow constraints and medical ontology validation to ensure both factual accuracy and clinical safety; and (2) Capability-A ware Group Relative Policy Optimization, where we theoretically derive optimal hybrid reward weighting strategies that maintain capability orthogonality during reinforcement learning, incorporating a reward model that scores business data adapted to biomedical downstream tasks, achieving true multi-dimensional hybrid RL with both rule-based and model-based scores . Through rigorous mathematical analysis, we prove that our training objective achieves Pareto-optimal convergence where improvements in one capability domain preserve performance in others--addressing a fundamental alignment challenge in medical AI. BalancedBio demonstrates state-of-the-art performance within its parameter class: domain expertise (80.95% BIOMED-MMLU, +15.32% over best baseline), reasoning capabilities (61.94%, +7.75%), instruction-following (67.95%, +6.44%), and integration score (86.7%, +18.5%). Critically, we provide theoretical safety guarantees with formal bounds on capability preservation and clinical accuracy maintenance. Real-world deployment across healthcare institutions validates practical impact: 78% cost reduction, 23% improved diagnostic accuracy, and 89% clinician acceptance rate. Our work establishes a principled methodology for biomedical AI alignment, demonstrating that sophisticated reasoning capabilities can be achieved efficiently while maintaining safety and reliability constraints essential for medical applications. We will release the 0.5B V ersion of our model.


Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Lupidi, Alisia, Gemmell, Carlos, Cancedda, Nicola, Dwivedi-Yu, Jane, Weston, Jason, Foerster, Jakob, Raileanu, Roberta, Lomeli, Maria

arXiv.org Artificial Intelligence

Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.