Towards a Small Language Model Lifecycle Framework

Miraghaei, Parsa, Moreschini, Sergio, Kolehmainen, Antti, Hästbacka, David

arXiv.org Artificial Intelligence 

Benchmark suites such as MMLU and HellaSwag measure core capabilities but are vulnerable to data contamination, making careful curation and transparent reporting essential [OS21], [OS2], [OS13], [OS6]. Trustworthiness evaluation covers robustness to adversarial inputs, privacy protection, reliability (including hallucination and consistency), and safety concerns such as toxicity and bias [OS2], [OS6], all of which are vital for user-facing or high-stakes deployments. Resource efficiency--spanning computational cost, memory, energy, and deployment overhead--is particularly important for SLMs and shapes deployment strategies in constrained environments [OS5], [OS6]. Automated evaluation methods range from statistical scorers like BLEU and ROUGE to model-based and hybrid approaches, with the latter providing stronger alignment with human judgment and greater scalability [OS29], [OS30]. Ultimately, evaluation should be an integrated, continuous process that informs model iteration, balances performance with sustainability and safety, and supports real-world usability at scale.