SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification
Yoon, Kanghoon, Kim, Minsub, Lee, Sungjae, Lee, Joonhyung, Woo, Sunghyeon, In, Yeonjun, Kwon, Se Jung, Park, Chanyoung, Lee, Dongsoo
–arXiv.org Artificial Intelligence
Empirical scaling laws establish a relationship between the number of parameters and model capability, as evidenced by models with hundreds of billions of parameters achieving state-of-the-art results on benchmarks (Kaplan et al., 2020; Grattafiori et al., 2024). However, the autoregressive generation process requires accessing all model parameters for each forward pass, creating a memory bandwidth bottleneck that impacts token generation latency. Furthermore, current trends toward more sophisticated LLM applications, such as multi-hop reasoning (Wei et al., 2022), tool integration (Patil et al., 2024), and reasoning capability (Y ang et al., 2025; DeepMind, 2025), produce longer output sequences, amplifying the computational burden of inference. One prominent approach to address inference latency is Speculative Decoding (SD), which achieves partial parallelization of the generation process (Leviathan et al., 2023; Chen et al., 2023). Standard SD operates by deploying a computationally efficient draft model to propose candidate token sequences, which are subsequently validated in parallel by the target model (the model of interest). The acceptance criterion for draft tokens relies on a probability-based alignment verification: draft tokens are accepted when their likelihood under the target model meets or exceeds their likelihood under the draft model.
arXiv.org Artificial Intelligence
Oct-6-2025
- Country:
- Africa > Mali (0.04)
- Asia > Middle East
- Jordan (0.04)
- Saudi Arabia > Asir Province
- Abha (0.04)
- Europe > Monaco (0.04)
- North America
- Mexico > Mexico City
- Mexico City (0.04)
- United States > Minnesota
- Hennepin County > Minneapolis (0.14)
- Mexico > Mexico City
- Genre:
- Research Report (0.83)
- Technology: