fk-steering
Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo
Wang, Weixin, Yang, Yu, Deng, Wei, Xu, Pan
We study inference-time alignment for diffusion-based generative models, aiming to steer a base model toward high-reward outputs without updating its weights. Recent Sequential Monte Carlo (SMC)-based steering methods approximate reward-tilted target distributions in a principled way, but their proposals remain largely tied to the base sampler. Since reward information is mainly used after propagation through particle reweighting and resampling, these methods can require large particle budgets and suffer from weight degeneracy and high-variance estimates. One way to reduce variance and improve particle efficiency is to iteratively learn twisting functions that provide look-ahead guidance, as in twisted SMC. However, existing learnable twisting methods are developed mainly for classical sequential inference and can be unstable when applied to diffusion-based alignment with high-dimensional state spaces and terminal, noisy, or black-box rewards. We propose Trust-Region Iterative Twisted Sequential Monte Carlo (TRI-TSMC), a trust-region framework for learning twisting functions in SMC-based inference-time alignment. Each iteration computes an exact KL-constrained update in path space, which admits a closed-form solution by tempered importance reweighting, and projects this target back to the parameterized twisted family by weighted maximum likelihood. Theoretically, we formalize the value-function interpretation of the optimal twisting function and show that it yields a zero-variance sampler. We prove that the trust-region update follows an escort path toward the target distribution, that the weighted maximum-likelihood update is a forward-KL projection, and that the path reduces residual importance-weight variance. Empirically, TRI-TSMC improves primary alignment objectives on discrete diffusion text generation and text-to-image generation under matched inference-time budgets.
Simple Approximation and Derivative Free Inference-Time Scaling for Diffusion Models via Sequential Monte Carlo on Path Measures
Wang, Chenyang, Wang, Weizhong, Ren, Yinuo, Blanchet, Jose, Lu, Yiping
Modern generative models have emerged as a powerful Diffusion-based generative models increasingly paradigm for learning complex, high-dimensional data distributions. In particular, diffusion models (Ho et al., 2020; rely on inference-time guidance, adding a drift Sohl-Dickstein et al., 2015; Song and Ermon, 2019; Song term or reweighting mixture of experts, to imet al., 2020) and flow-based methods (Zhang et al., 2018a; prove sample quality on task-specific objectives. However, most existing techniques reLipman et al., 2022; Albergo and Vanden-Eijnden, 2022; Liu quire repeated score or gradient evaluations, inet al., 2022) provide a principled and scalable framework for generative modeling, achieving state-of-the-art performance troducing bias, high computational overhead, or across diverse applications, including video generation (Ho both. We introduce URGE, approximation-free et al., 2022), protein design (Gruver et al., 2023), and largeResampling via Girsanov Estimation, a derivativefree inference-time scaling algorithm that perscale text generation (Li et al., 2022; Nie et al., 2025). A forms pathwise importance reweighting via a Girunifying perspective underlying these approaches is their formulation in terms of stochastic differential equations sanov change of measure.
Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models
Yu, Zichao, Li, Ming, Zhang, Wenyi, Gao, Weiguo
Tree search has recently emerged as a powerful framework for aligning generative models with task-specific rewards at test time. Applying tree search to Masked Diffusion Language Models, however, introduces two key challenges: (i) parallel unmasking yields highly correlated branches, limiting exploration, and (ii) reward evaluation via sampled completions produces high-variance estimates, making pruning unstable. Theoretically, we quantify branching efficiency gains in NFEs (number of function evaluations), show that the scoring rule approximates the true reward with error bounded by predictive uncertainty, and prove improvements with larger tree widths. Masked Diffusion Language Models (MDLMs) (Nie et al., 2025; Sahoo et al., 2024; Shi et al., 2024; Y ang et al., 2025b) have emerged as a compelling alternative to autoregressive models (Brown et al., 2020; Radford et al., 2019; Touvron et al., 2023). They start with all-mask tokens and gradually reveal tokens through a sequence of discrete denoising steps. At each step, the model predicts token distributions for masked positions, conditioned on the current partially masked sequence and the diffusion timestep. This formulation enables flexible sampling schedules and broad conditioning patterns, making MDLMs well-suited for controllable generation tasks.Figure 1: Conceptual illustration of TR However, this flexibility is not fully realized without mechanisms to align the model's outputs with user-defined objectives. Test-Time Alignment (TT A) enables guiding language model outputs toward task-specific goals without retraining. In applications such as toxicity avoidance (Logacheva et al., 2022), sentiment control (Barbieri et al., 2020), or enforcing linguistic acceptability (Warstadt et al., 2019), aligning generation with external reward functions at test time offers a flexible and training-free alternative to supervised fine-tuning.