Problem Solving
InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning
Liu, Zeyu, Hou, Zhitian, Zhu, Guanghao, Sang, Zhijie, Xie, Congkai, Yang, Hongxia
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in domains such as visual understanding and mathematical reasoning. However, their application in the medical domain is constrained by two key challenges: (1) multimodal medical datasets are scarce and often contain sparse information, limiting reasoning depth; and (2) Reinforcement Learning with Verifiable Rewards (RLVR), though effective in general domains, cannot reliably improve model performance in the medical domain. To overcome these challenges, during the supervised fine-tuning (SFT) stage, we incorporate high-quality textual reasoning data and general multimodal data alongside multimodal medical data to efficiently enhance foundational medical capabilities and restore the base model's reasoning ability. Moreover, considering that there are some multimodal medical datasets with sparse information, we further synthesize reflective-pattern-injected chain-of-thought (CoT) in addition to general CoT samples, equipping the model with initial reflective reasoning capabilities that provide a structured foundation for subsequent RLVR training. Finally, we introduce our InfiMed-Series models, InfiMed-SFT-3B and InfiMed-RL-3B, both of which deliver state-of-the-art performance across seven multimodal medical benchmarks. Notably, InfiMed-RL-3B achieves an average accuracy of 59.2%, outperforming even larger models like InternVL3-8B, which achieves 57.3%. Specifically, during the SFT phase, we utilized 188K samples, while the RLVR phase incorporated 36K samples, demonstrating the efficacy of both training strategies in achieving superior performance. We also conducted a series of extensive experiments, which provide valuable insights that contribute to advancing the performance of MLLMs in medical scenarios.
Controlled Agentic Planning & Reasoning for Mechanism Synthesis
Gandarela, João Pedro, Rios, Thiago, Menzel, Stefan, Freitas, André
This work presents a dual-agent \ac{llm}-based reasoning framework for automated planar mechanism synthesis that tightly couples linguistic specification with symbolic representation and simulation. From a natural-language task description, the system composes symbolic constraints and equations, generates and parametrises simulation code, and iteratively refines designs via critic-driven feedback, including symbolic regression and geometric distance metrics, closing an actionable linguistic/symbolic optimisation loop. To evaluate the approach, we introduce MSynth, a benchmark of analytically defined planar trajectories. Empirically, critic feedback and iterative refinement yield large improvements (up to 90\% on individual tasks) and statistically significant gains per the Wilcoxon signed-rank test. Symbolic-regression prompts provide deeper mechanistic insight primarily when paired with larger models or architectures with appropriate inductive biases (e.g., LRM).
HopWeaver: Cross-Document Synthesis of High-Quality and Authentic Multi-Hop Questions
Shen, Zhiyu, Liu, Jiyuan, Pang, Yunhe, Rao, Yanghui
Multi-Hop Question Answering (MHQA) is crucial for evaluating the model's capability to integrate information from diverse sources. However, creating extensive and high-quality MHQA datasets is challenging: (i) manual annotation is expensive, and (ii) current synthesis methods often produce simplistic questions or require extensive manual guidance. This paper introduces HopWeaver, the first cross-document framework synthesizing authentic multi-hop questions without human intervention. HopWeaver synthesizes bridge and comparison questions through an innovative pipeline that identifies complementary documents and constructs authentic reasoning paths to ensure true multi-hop reasoning. We further present a comprehensive system for evaluating the synthesized multi-hop questions. Empirical evaluations demonstrate that the synthesized questions achieve comparable or superior quality to human-annotated datasets at a lower cost. Our framework provides a valuable tool for the research community: it can automatically generate challenging benchmarks from any raw corpus, which opens new avenues for both evaluation and targeted training to improve the reasoning capabilities of advanced QA models, especially in domains with scarce resources.
Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting
Existing methods on visual knowledge extraction often rely on the predefined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation types), restricting the expressiveness of the extracted knowledge. In this work, we take a first exploration to a new paradigm of open visual knowledge extraction.