seed question
DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry
Wu, Changti, Lian, Shijie, Liu, Zihao, Zhang, Lei, Yang, Laurence Tianruo, Chen, Kai
However, most existing multimodal mathematical reasoning benchmarks focus primarily on 2D plane geometry, rely on static datasets prone to data contamination and memorization, and evaluate models solely by final answers, overlooking the reasoning process. T o address these limitations, we introduce DynaSolidGeo, the first dynamic benchmark for evaluating genuine spatial reasoning in Vision-Language Models (VLMs). Constructed through a semi-automatic annotation pipeline, DynaSolidGeo contains 503 expert-curated seed questions that can, in principle, dynamically generate an unbounded number of diverse multimodal text-visual instances. Beyond answer accuracy, we incorporate process evaluation based on expert-annotated reasoning chains to measure logical validity and causal coherence. Experiments across representative open-source and closed-source VLMs reveal large performance gaps, severe degradation in dynamic settings, and poor performance on tasks requiring high-level spatial intelligence, such as mental rotation and visualization. The code and dataset are available at DynaSolidGeo.
Composition-Grounded Instruction Synthesis for Visual Reasoning
Gu, Xinyi, Mao, Jiayuan, Hong, Zhang-Wei, Yu, Zhuoran, Li, Pengyuan, Joshi, Dhiraj, Feris, Rogerio, He, Zexue
Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages. Pretrained multi-modal large language models (MLLMs) have achieved impressive performance across a wide range of multimodal tasks (Liu et al., 2023c; Bai et al., 2025; Wang et al., 2025a; Agrawal et al., 2024; OpenAI et al., 2024; Comanici et al., 2025; Anthropic, 2024), yet advanced reasoning capabilities remain underdeveloped, especially in domains where user reasoning-intensive query-answer data is difficult to collect. In this work, we consider reasoning capability over artificial image domains, including charts, tables, information graphs, rendered documents, webpages, etc. While such images are abundant on the web, datasets containing reasoning questions over them are scarce.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Dominican Republic (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning
Anantha, Raviteja, Hor, Soheil, Antoniu, Teodor Nicola, Price, Layne C.
We present NanoFlux, a novel adversarial framework for generating targeted training data to improve LLM reasoning, where adversarially-generated datasets of 200 examples outperform conventional fine-tuning approaches. The framework employs a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge, synthesizing multi-step questions with explanatory annotations that target specific reasoning capabilities. Fine-tuning a 4B-parameter model on NanoFlux-generated data yields performance gains across diverse domains compared to full-benchmark fine-tuning: +5.9% on mathematical reasoning (GSMHard), +3.6% on scientific reasoning (GenomeBench), and +16.6% on medical reasoning (MultiMedQA), while reducing computational requirements by 3-14 . Ablation studies reveal a nonmonotonic relationship between dataset characteristics and model performance, uncovering domain-specific optimal points for question complexity and reasoning quality. NanoFlux automates training data generation through embedding-based novelty filtering, tool-augmented evaluation, and multi-hop reasoning, suggesting that future model improvements may lie in the intelligent synthesis of small, precisely targeted training datasets. As large language models (LLMs) rapidly approach and surpass human-level performance on established benchmarks, we confront a fundamental limitation: the finite nature of high-quality training data. Today's frontier models have effectively consumed the entirety of available text on the internet, yet continue to exhibit critical reasoning failures and knowledge gaps. This "benchmark exhaustion" phenomenon raises crucial questions about how to advance AI capabilities beyond the constraints of existing data. While generating synthetic training examples represents one potential path forward, creating effective synthetic data remains challenging - naive generation approaches often produce low-information samples that fail to improve model performance, while synthesizing effective datasets typically requires precisely the kind of human expertise and curation that we seek to automate. Recent work, notably LIMO (Y e et al., 2025), has demonstrated that small, carefully cu-rated datasets of high-quality chain-of-thought solutions can unlock strong reasoning performance, but still depends on human effort in curation. We introduce NanoFlux, a fully generative adversarial framework that reimagines data-efficient reasoning improvement. NanoFlux orchestrates a competitive dynamic between two models alternating as Attacker and Defender, supervised by a tool-augmented Judge that evaluates responses for accuracy, coherence, and safety (as shown in Figure 1).
- North America > United States (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
- Education (0.86)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.68)
- Health & Medicine > Diagnostic Medicine (0.49)
CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
Yu, Ping, Lanchantin, Jack, Wang, Tianlu, Yuan, Weizhe, Golovneva, Olga, Kulikov, Ilia, Sukhbaatar, Sainbayar, Weston, Jason, Xu, Jing
We propose CoT -Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on given seed tasks, and then generate a new synthetic example of similar quality and complexity. This is followed by a filtering step to select high-quality data using automatic metrics, which are then used for LLM training. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, when evaluated on MA TH500, AMC23, AIME24, and GPQA-Diamond. The transformative rise of Large Language Models (LLMs) has initiated a substantial paradigm shift in the domain of deep learning (Zhang et al., 2023; Guo et al., 2023; Long et al., 2024). The development of such models emphasizes scale, and relies heavily on large volumes of high-quality data (Gandhi et al., 2024; Abdin et al., 2024). However, acquiring such data from human sources can often be challenging or even impractical due to factors such as high costs, data scarcity, and privacy concerns (Kurakin et al., 2023). Furthermore, several studies (Hosking et al., 2023; Singh et al., 2023; Gilardi et al., 2023) have pointed out that human-generated data, being inherently prone to biases and errors, may not always be ideal for model training or evaluation. In this context, synthetic data emerges as a viable alternative for obtaining high-quality datasets. Synthetic data is artificially generated to replicate the characteristics and patterns of real-world data. One innovative approach to creating such data is the Self-Instruct method (Wang et al., 2022a), which utilizes LLMs themselves to generate instruction-following examples. This method begins by selecting a small set of seed instruction-following samples, which are then used to prompt LLMs to produce additional demonstrations in a similar format. Since then, a number of variants have been introduced that increase the complexity of queries (Liu et al., 2023; Zeng et al., 2024), maintain semantic diversity (Ding et al., 2023), scale the synthetic data (Y uan et al., 2023), and use these methods in self-improvement loops (Y uan et al., 2024).
XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering
Roh, Keon-Woo, Ju, Yeong-Joon, Lee, Seong-Whan
Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA), yet most evaluations focus on English and assume locale-invariant answers across languages. This assumption neglects the cultural and regional variations that affect question understanding and answer, leading to biased evaluation in multilingual benchmarks. To address these limitations, we introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA. XLQA contains 3,000 English seed questions expanded to eight languages, with careful filtering for semantic consistency and human-verified annotations distinguishing locale-invariant and locale-sensitive cases. Our evaluation of five state-of-the-art multilingual LLMs reveals notable failures on locale-sensitive questions, exposing gaps between English and other languages due to a lack of locale-grounding knowledge. We provide a systematic framework and scalable methodology for assessing multilingual QA under diverse cultural contexts, offering a critical resource to advance the real-world applicability of multilingual ODQA systems. Our findings suggest that disparities in training data distribution contribute to differences in both linguistic competence and locale-awareness across models.
- North America > United States > Rocky Mountains (0.04)
- North America > Canada > Rocky Mountains (0.04)
- Asia > South Korea (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
- Media (0.93)
- Leisure & Entertainment (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
- Europe > United Kingdom (0.46)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.68)
- Government (0.94)
- Health & Medicine > Consumer Health (0.46)
ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark
Shalyt, Michael, Elimelech, Rotem, Kaminer, Ido
Large language models (LLMs) are rapidly approaching the level of proficiency in university-level symbolic mathematics required for applications in advanced science and technology. However, existing benchmarks fall short in assessing the core skills of LLMs in symbolic mathematics-such as integration, differential equations, and algebraic simplification. To address this gap, we introduce ASyMOB, a novel assessment framework focused exclusively on symbolic manipulation, featuring 17,092 unique math challenges, organized by similarity and complexity. ASyMOB enables analysis of LLM generalization capabilities by comparing performance in problems that differ by simple numerical or symbolic `perturbations'. Evaluated LLMs exhibit substantial degradation in performance for all perturbation types (up to -70.3%), suggesting reliance on memorized patterns rather than deeper understanding of symbolic math, even among models achieving high baseline accuracy. Comparing LLM performance to computer algebra systems, we identify examples where they fail while LLMs succeed, as well as problems solved only by combining both approaches. Models capable of integrated code execution yielded higher accuracy compared to their performance without code, particularly stabilizing weaker models (up to +33.1% for certain perturbation types). Notably, the most advanced models (o4-mini, Gemini 2.5 Flash) demonstrate not only high symbolic math proficiency (scoring 96.8% and 97.6% on the unperturbed set), but also remarkable robustness against perturbations, (-21.7% and -21.2% vs. average -50.4% for the other models). This may indicate a recent "phase transition" in the generalization capabilities of frontier LLMs. It remains to be seen whether the path forward lies in deeper integration with sophisticated external tools, or in developing models so capable that symbolic math systems like CAS become unnecessary.
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Illinois > Champaign County > Champaign (0.04)
- (3 more...)
MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions
Xu, Zhe, Chen, Daoyuan, Ling, Zhenqing, Li, Yaliang, Shen, Ying
Large vision-language models (VLMs) face challenges in achieving robust, transferable reasoning abilities due to reliance on labor-intensive manual instruction datasets or computationally expensive self-supervised methods. To address these issues, we introduce MindGYM, a framework that enhances VLMs through synthetic self-challenging questions, consisting of three stages: (1) Seed Single-Hop Question Synthesis, generating cognitive questions across textual (e.g., logical deduction) and multimodal contexts (e.g., diagram-based queries) spanning eight semantic areas like ethical analysis; (2) Challenging Multi-Hop Question Synthesis, combining seed questions via diverse principles like bridging, visual-textual alignment, to create multi-step problems demanding deeper reasoning; and (3) Thinking-Induced Curriculum Fine-Tuning, a structured pipeline that progressively trains the model from scaffolded reasoning to standalone inference. By leveraging the model's self-synthesis capability, MindGYM achieves high data efficiency (e.g., +16% gains on MathVision-Mini with only 400 samples), computational efficiency (reducing both training and inference costs), and robust generalization across tasks. Extensive evaluations on seven benchmarks demonstrate superior performance over strong baselines, with notable improvements (+15.77% win rates) in reasoning depth and breadth validated via GPT-based scoring. MindGYM underscores the viability of self-challenging for refining VLM capabilities while minimizing human intervention and resource demands. Code and data are released to advance multimodal reasoning research.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Asia > China (0.04)
Adaptable and Precise: Enterprise-Scenario LLM Function-Calling Capability Training Pipeline
Zeng, Guancheng, Ding, Wentao, Xu, Beining, Zhang, Chi, Han, Wenqiang, Li, Gang, Mo, Jingjing, Qiu, Pengxu, Tao, Xinran, Tao, Wang, Hu, Haowen
Enterprises possess a vast array of API assets scattered across various functions, forming the backbone of existing business processes. By leveraging these APIs as functional tools, enterprises can design diverse, scenario-specific agent applications, driven by on-premise function-calling models as the core engine. However, generic models often fail to meet enterprise requirements in terms of computational efficiency, output accuracy, and stability, necessitating scenario-specific adaptation. In this paper, we propose a training pipeline for function-calling capabilities tailored to real-world business scenarios. This pipeline includes the synthesis and augmentation of scenario-specific function-calling data, model fine-tuning, and performance evaluation and analysis. Using this pipeline, we generated 1,260 fully AI-generated samples and 1,035 augmented manually-labeled samples in digital HR agent scenario. The Qwen2.5-Coder-7B-Instruct model was employed as the base model and fine-tuned using the LoRA method on four GPUs with 24GB VRAM. Our fine-tuned model demonstrated outstanding performance in evaluations and practical applications, surpassing GPT-4 and GPT-4o in accuracy on the test set. These results validate the reliability of the proposed pipeline for training scenario-specific function-calling models.
- Asia > China (0.28)
- North America (0.28)
- Education (0.66)
- Information Technology (0.46)
- Materials > Chemicals > Industrial Gases > Liquified Gas (0.30)
- (2 more...)
Detecting Redundant Health Survey Questions Using Language-agnostic BERT Sentence Embedding (LaBSE)
Kang, Sunghoon, Kim, Hyeoneui, Park, Hyewon, Taira, Ricky
The goal of this work was to compute the semantic similarity among publicly available health survey questions in order to facilitate the standardization of survey-based Person-Generated Health Data (PGHD). We compiled various health survey questions authored in both English and Korean from the NIH CDE Repository, PROMIS, Korean public health agencies, and academic publications. Questions were drawn from various health lifelog domains. A randomized question pairing scheme was used to generate a Semantic Text Similarity (STS) dataset consisting of 1758 question pairs. Similarity scores between each question pair were assigned by two human experts. The tagged dataset was then used to build three classifiers featuring: Bag-of-Words, SBERT with BERT-based embeddings, and SBRET with LaBSE embeddings. The algorithms were evaluated using traditional contingency statistics. Among the three algorithms, SBERT-LaBSE demonstrated the highest performance in assessing question similarity across both languages, achieving an Area Under the Receiver Operating Characteristic (ROC) and Precision-Recall Curves of over 0.99. Additionally, it proved effective in identifying cross-lingual semantic similarities.The SBERT-LaBSE algorithm excelled at aligning semantically equivalent sentences across both languages but encountered challenges in capturing subtle nuances and maintaining computational efficiency. Future research should focus on testing with larger multilingual datasets and on calibrating and normalizing scores across the health lifelog domains to improve consistency. This study introduces the SBERT-LaBSE algorithm for calculating semantic similarity across two languages, showing it outperforms BERT-based models and the Bag of Words approach, highlighting its potential to improve semantic interoperability of survey-based PGHD across language barriers.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Asia > South Korea > Seoul > Seoul (0.05)
- Questionnaire & Opinion Survey (1.00)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.46)
- Health & Medicine > Health Care Providers & Services (1.00)
- Health & Medicine > Consumer Health (1.00)
- Health & Medicine > Therapeutic Area (0.94)