rlg
main
We then discuss, in 2.2, the challenges one confronts when attempting to address the above two problems directly using derivative-free PG methods by sampling system trajectories. Fortunately, solving zero-sum LQ (stochastic) dynamic games, a benchmark setting in MARL, via derivative-free PG methods by sampling system trajectories provides a workaround to address these problems all in a unified way, due to the well-known equivalence relationships between zero-sum LQ dynamic games and the two aforementioned classes of problems [25], which we will also discuss in A.3.3. A.3.1 Linear Exponential Quadratic Gaussian We first consider a fundamental setting of risk-sensitive optimal control, known as the LEQG problem [22, 27, 28], in the finite-horizon setting. The time-varying (linear) systems dynamics are described by: xt+1 =Atxt +Btut +wt,t 2{0,,N 1}, where xt 2Rm represents the system state; ut 2Rd is the control input; wt 2Rm is an independent (across time) Gaussian random noise drawn from wt N (0,W) for some W> 0; the initial state x0 N (0,X0) is a Gaussian random vector for some X0 >0, independent of the sequence {wt};and At, Bt are time-varying system matrices with appropriate dimensions.
Revealing and Protecting Labels in Distributed Training
Distributed learning paradigms such as federated learning often involve transmission of model updates, or gradients, over a network, thereby avoiding transmission of private data. However, it is possible for sensitive information about the training data to be revealed from such gradients. Prior works have demonstrated that labels can be revealed analytically from the last layer of certain models (e.g., ResNet), or they can be reconstructed jointly with model inputs by using Gradients Matching [1] with additional knowledge about the current state of the model. In this work, we propose a method to discover the set of labels of training samples from only the gradient of the last layer and the id to label mapping. Our method is applicable to a wide variety of model architectures across multiple domains. We demonstrate the effectiveness of our method for model training in two domains - image classification, and automatic speech recognition. Furthermore, we show that existing reconstruction techniques improve their efficacy when used in conjunction with our method. Conversely, we demonstrate that gradient quantization and sparsification can significantly reduce the success of the attack.
Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance
Jin, Luozhijie, Qiu, Zijie, Liu, Jie, Diao, Zijie, Qiao, Lifeng, Ding, Ning, Lamb, Alex, Qiu, Xipeng
Denoising-based generative models, particularly diffusion and flow matching algorithms, have achieved remarkable success. However, aligning their output distributions with complex downstream objectives, such as human preferences, compositional accuracy, or data compressibility, remains challenging. While reinforcement learning (RL) fine-tuning methods, inspired by advances in RL from human feedback (RLHF) for large language models, have been adapted to these generative frameworks, current RL approaches are suboptimal for diffusion models and offer limited flexibility in controlling alignment strength after fine-tuning. In this work, we reinterpret RL fine-tuning for diffusion models through the lens of stochastic differential equations and implicit reward conditioning. We introduce Reinforcement Learning Guidance (RLG), an inference-time method that adapts Classifier-Free Guidance (CFG) by combining the outputs of the base and RL fine-tuned models via a geometric average. Our theoretical analysis shows that RLG's guidance scale is mathematically equivalent to adjusting the KL-regularization coefficient in standard RL objectives, enabling dynamic control over the alignment-quality trade-off without further training. Extensive experiments demonstrate that RLG consistently improves the performance of RL fine-tuned models across various architectures, RL algorithms, and downstream tasks, including human preferences, compositional control, compressibility, and text rendering. Furthermore, RLG supports both interpolation and extrapolation, thereby offering unprecedented flexibility in controlling generative alignment. Our approach provides a practical and theoretically sound solution for enhancing and controlling diffusion model alignment at inference. The source code for RLG is publicly available at the Github: https://github.com/jinluo12345/Reinforcement-learning-guidance.