NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning

Anantha, Raviteja, Hor, Soheil, Antoniu, Teodor Nicola, Price, Layne C.

arXiv.org Artificial Intelligence 

We present NanoFlux, a novel adversarial framework for generating targeted training data to improve LLM reasoning, where adversarially-generated datasets of 200 examples outperform conventional fine-tuning approaches. The framework employs a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge, synthesizing multi-step questions with explanatory annotations that target specific reasoning capabilities. Fine-tuning a 4B-parameter model on NanoFlux-generated data yields performance gains across diverse domains compared to full-benchmark fine-tuning: +5.9% on mathematical reasoning (GSMHard), +3.6% on scientific reasoning (GenomeBench), and +16.6% on medical reasoning (MultiMedQA), while reducing computational requirements by 3-14 . Ablation studies reveal a nonmonotonic relationship between dataset characteristics and model performance, uncovering domain-specific optimal points for question complexity and reasoning quality. NanoFlux automates training data generation through embedding-based novelty filtering, tool-augmented evaluation, and multi-hop reasoning, suggesting that future model improvements may lie in the intelligent synthesis of small, precisely targeted training datasets. As large language models (LLMs) rapidly approach and surpass human-level performance on established benchmarks, we confront a fundamental limitation: the finite nature of high-quality training data. Today's frontier models have effectively consumed the entirety of available text on the internet, yet continue to exhibit critical reasoning failures and knowledge gaps. This "benchmark exhaustion" phenomenon raises crucial questions about how to advance AI capabilities beyond the constraints of existing data. While generating synthetic training examples represents one potential path forward, creating effective synthetic data remains challenging - naive generation approaches often produce low-information samples that fail to improve model performance, while synthesizing effective datasets typically requires precisely the kind of human expertise and curation that we seek to automate. Recent work, notably LIMO (Y e et al., 2025), has demonstrated that small, carefully cu-rated datasets of high-quality chain-of-thought solutions can unlock strong reasoning performance, but still depends on human effort in curation. We introduce NanoFlux, a fully generative adversarial framework that reimagines data-efficient reasoning improvement. NanoFlux orchestrates a competitive dynamic between two models alternating as Attacker and Defender, supervised by a tool-augmented Judge that evaluates responses for accuracy, coherence, and safety (as shown in Figure 1).