Cai, Xu
FlexControl: Computation-Aware ControlNet with Differentiable Router for Text-to-Image Generation
Fang, Zheng, Xiang, Lichuan, Cai, Xu, Zhou, Kaicheng, Wen, Hongkai
ControlNet offers a powerful way to guide diffusion-based generative models, yet most implementations rely on ad-hoc heuristics to choose which network blocks to control-an approach that varies unpredictably with different tasks. To address this gap, we propose FlexControl, a novel framework that copies all diffusion blocks during training and employs a trainable gating mechanism to dynamically select which blocks to activate at each denoising step. With introducing a computation-aware loss, we can encourage control blocks only to activate when it benefit the generation quality. By eliminating manual block selection, FlexControl enhances adaptability across diverse tasks and streamlines the design pipeline, with computation-aware training loss in an end-to-end training manner. Through comprehensive experiments on both UNet (e.g., SD1.5) and DiT (e.g., SD3.0), we show that our method outperforms existing ControlNet variants in certain key aspects of interest. As evidenced by both quantitative and qualitative evaluations, FlexControl preserves or enhances image fidelity while also reducing computational overhead by selectively activating the most relevant blocks. These results underscore the potential of a flexible, data-driven approach for controlled diffusion and open new avenues for efficient generative model design.
K Nearest Neighbor-Guided Trajectory Similarity Learning
Chang, Yanchuan, Cai, Xu, Jensen, Christian S., Qi, Jianzhong
Trajectory similarity is fundamental to many spatio-temporal data mining applications. Recent studies propose deep learning models to approximate conventional trajectory similarity measures, exploiting their fast inference time once trained. Although efficient inference has been reported, challenges remain in similarity approximation accuracy due to difficulties in trajectory granularity modeling and in exploiting similarity signals in the training data. To fill this gap, we propose TSMini, a highly effective trajectory similarity model with a sub-view modeling mechanism capable of learning multi-granularity trajectory patterns and a k nearest neighbor-based loss that guides TSMini to learn not only absolute similarity values between trajectories but also their relative similarity ranks. Together, these two innovations enable highly accurate trajectory similarity approximation. Experiments show that TSMini can outperform the state-of-the-art models by 22% in accuracy on average when learning trajectory similarity measures.
No More Adam: Learning Rate Scaling at Initialization is All You Need
Xu, Minghao, Xiang, Lichuan, Cai, Xu, Wen, Hongkai
In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2 pretraining for large language models (LLMs, transformer decoder-only), demonstrating robustness to hyperparameter variations and practicality for diverse applications. We further tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion models, where it consistently outperforms state-of-the-art optimizers. From a memory efficiency perspective, SGD-SaI achieves substantial memory savings for optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B compared to AdamW in full-precision training settings.
Lower Bounds for Time-Varying Kernelized Bandits
Cai, Xu, Scarlett, Jonathan
The optimization of black-box functions with noisy observations is a fundamental problem with widespread applications, and has been widely studied under the assumption that the function lies in a reproducing kernel Hilbert space (RKHS). This problem has been studied extensively in the stationary setting, and near-optimal regret bounds are known via developments in both upper and lower bounds. In this paper, we consider non-stationary scenarios, which are crucial for certain applications but are currently less well-understood. Specifically, we provide the first algorithm-independent lower bounds, where the time variations are subject satisfying a total variation budget according to some function norm. Under $\ell_{\infty}$-norm variations, our bounds are found to be close to the state-of-the-art upper bound (Hong \emph{et al.}, 2023). Under RKHS norm variations, the upper and lower bounds are still reasonably close but with more of a gap, raising the interesting open question of whether non-minor improvements in the upper bound are possible.
Kernelized Normalizing Constant Estimation: Bridging Bayesian Quadrature and Bayesian Optimization
Cai, Xu, Scarlett, Jonathan
In this paper, we study the problem of estimating the normalizing constant $\int e^{-\lambda f(x)}dx$ through queries to the black-box function $f$, where $f$ belongs to a reproducing kernel Hilbert space (RKHS), and $\lambda$ is a problem parameter. We show that to estimate the normalizing constant within a small relative error, the level of difficulty depends on the value of $\lambda$: When $\lambda$ approaches zero, the problem is similar to Bayesian quadrature (BQ), while when $\lambda$ approaches infinity, the problem is similar to Bayesian optimization (BO). More generally, the problem varies between BQ and BO. We find that this pattern holds true even when the function evaluations are noisy, bringing new aspects to this topic. Our findings are supported by both algorithm-independent lower bounds and algorithmic upper bounds, as well as simulation studies conducted on a variety of benchmark functions.
On Average-Case Error Bounds for Kernel-Based Bayesian Quadrature
Cai, Xu, Lam, Chi Thanh, Scarlett, Jonathan
In this paper, we study error bounds for {\em Bayesian quadrature} (BQ), with an emphasis on noisy settings, randomized algorithms, and average-case performance measures. We seek to approximate the integral of functions in a {\em Reproducing Kernel Hilbert Space} (RKHS), particularly focusing on the Mat\'ern-$\nu$ and squared exponential (SE) kernels, with samples from the function potentially being corrupted by Gaussian noise. We provide a two-step meta-algorithm that serves as a general tool for relating the average-case quadrature error with the $L^2$-function approximation error. When specialized to the Mat\'ern kernel, we recover an existing near-optimal error rate while avoiding the existing method of repeatedly sampling points. When specialized to other settings, we obtain new average-case results for settings including the SE kernel with noise and the Mat\'ern kernel with misspecification. Finally, we present algorithm-independent lower bounds that have greater generality and/or give distinct proofs compared to existing ones.
Lenient Regret and Good-Action Identification in Gaussian Process Bandits
Cai, Xu, Gomes, Selwyn, Scarlett, Jonathan
In this paper, we study the problem of Gaussian process (GP) bandits under relaxed optimization criteria stating that any function value above a certain threshold is "good enough". On the theoretical side, we study various \emph{\lenient regret} notions in which all near-optimal actions incur zero penalty, and provide upper bounds on the lenient regret for GP-UCB and an elimination algorithm, circumventing the usual $O(\sqrt{T})$ term (with time horizon $T$) resulting from zooming extremely close towards the function maximum. In addition, we complement these upper bounds with algorithm-independent lower bounds. On the practical side, we consider the problem of finding a single "good action" according to a known pre-specified threshold, and introduce several good-action identification algorithms that exploit knowledge of the threshold. We experimentally find that such algorithms can often find a good action faster than standard optimization-based approaches.
On Lower Bounds for Standard and Robust Gaussian Process Bandit Optimization
Cai, Xu, Scarlett, Jonathan
The use of Gaussian process (GP) methods for black-box function optimization has seen significant advances in recent years, with applications including hyperparameter tuning, robotics, molecular design, and many more. On the theoretical side, a variety of algorithms have been developed with provable regret bounds [9,11,15,23,41,45,46], and algorithm-independent lower bounds have been given in several settings of interest [11, 14, 36, 37, 44]. These theoretical works can be broadly categorized into one of two types: In the Bayesian setting, one adopts a Gaussian process prior according to some kernel function, whereas in the non-Bayesian setting, the function is assumed to lie in some Reproducing Kernel Hilbert Space (RKHS) and be upper bounded in terms of the corresponding RKHS norm. In this paper, we focus on the non-Bayesian setting, and seek to broaden the existing understanding of algorithm-independent lower bounds on the regret, which have received relatively less attention compared to algorithmic upper bounds. Our main contributions are briefly summarized as follows: - In the standard noisy GP optimization setting, we provide an alternative proof strategy for the existing lower bounds of [37], as well as strengthening the dependence on the error probability and illustrating scenarios in which our approach is simpler and/or more versatile. The authors are with the Department of Computer Science, School of Computing, National University of Singapore (NUS). Jonathan Scarlett is also with the Department of Mathematics, NUS.
FRAME Revisited: An Interpretation View Based on Particle Evolution
Cai, Xu, Wu, Yang, Li, Guanbin, Chen, Ziliang, Lin, Liang
FRAME (Filters, Random fields, And Maximum Entropy) is an energy-based descriptive model that synthesizes visual realism by capturing mutual patterns from structural input signals. The maximum likelihood estimation (MLE) is applied by default, yet conventionally causes the unstable training energy that wrecks the generated structures, which remains unexplained. In this paper, we provide a new theoretical insight to analyze FRAME, from a perspective of particle physics ascribing the weird phenomenon to KL-vanishing issue. In order to stabilize the energy dissipation, we propose an alternative Wasserstein distance in discrete time based on the conclusion that the Jordan-Kinderlehrer-Otto (JKO) discrete flow approximates KL discrete flow when the time step size tends to 0. Besides, this metric can still maintain the model's statistical consistency. Quantitative and qualitative experiments have been respectively conducted on several widely used datasets. The empirical studies have evidenced the effectiveness and superiority of our method.