SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

Yoon, Kanghoon, Kim, Minsub, Lee, Sungjae, Lee, Joonhyung, Woo, Sunghyeon, In, Yeonjun, Kwon, Se Jung, Park, Chanyoung, Lee, Dongsoo

arXiv.org Artificial Intelligence 

Empirical scaling laws establish a relationship between the number of parameters and model capability, as evidenced by models with hundreds of billions of parameters achieving state-of-the-art results on benchmarks (Kaplan et al., 2020; Grattafiori et al., 2024). However, the autoregressive generation process requires accessing all model parameters for each forward pass, creating a memory bandwidth bottleneck that impacts token generation latency. Furthermore, current trends toward more sophisticated LLM applications, such as multi-hop reasoning (Wei et al., 2022), tool integration (Patil et al., 2024), and reasoning capability (Y ang et al., 2025; DeepMind, 2025), produce longer output sequences, amplifying the computational burden of inference. One prominent approach to address inference latency is Speculative Decoding (SD), which achieves partial parallelization of the generation process (Leviathan et al., 2023; Chen et al., 2023). Standard SD operates by deploying a computationally efficient draft model to propose candidate token sequences, which are subsequently validated in parallel by the target model (the model of interest). The acceptance criterion for draft tokens relies on a probability-based alignment verification: draft tokens are accepted when their likelihood under the target model meets or exceeds their likelihood under the draft model.