Multi-Model Synthetic Training for Mission-Critical Small Language Models
Platt, Nolan, Nayak, Pragyansmita
–arXiv.org Artificial Intelligence
Abstract--Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their application to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT -4o and o3-mini), preventing over-fitting and ensuring accurate reasoning. We show that smaller, cheaper models - when fine tuned properly - can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expanding research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries. In recent years, Large Language Models (LLMs) have proven successful across diverse natural language tasks, but their usage for specialized domains faces a large challenge: the cost of continuous LLM inference, often reaching thousands of dollars per day for real-time systems [1].
arXiv.org Artificial Intelligence
Sep-17-2025
- Country:
- Atlantic Ocean > Gulf of Mexico (0.04)
- North America
- Mexico (0.04)
- United States > Virginia
- Fairfax County > Reston (0.04)
- Montgomery County > Blacksburg (0.04)
- Genre:
- Research Report
- Experimental Study (0.68)
- New Finding (0.94)
- Research Report
- Industry:
- Technology: