Xin, Yi
TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation
Huang, Victor Shea-Jay, Zhuo, Le, Xin, Yi, Wang, Zhaokai, Gao, Peng, Li, Hongsheng
Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion models. To bridge this gap, we introduce TIDE (Temporal-aware Sparse Autoencoders for Interpretable Diffusion transformErs), a novel framework that enhances temporal reconstruction within DiT activation layers across denoising steps. TIDE employs Sparse Autoencoders (SAEs) with a sparse bottleneck layer to extract interpretable and hierarchical features, revealing that diffusion models inherently learn hierarchical features at multiple levels (e.g., 3D, semantic, class) during generative pre-training. Our approach achieves state-of-the-art reconstruction performance, with a mean squared error (MSE) of 1e-3 and a cosine similarity of 0.97, demonstrating superior accuracy in capturing activation dynamics along the denoising trajectory. Beyond interpretability, we showcase TIDE's potential in downstream applications such as sparse activation-guided image editing and style transfer, enabling improved controllability for generative systems. By providing a comprehensive training and evaluation protocol tailored for DiTs, TIDE contributes to developing more interpretable, transparent, and trustworthy generative models.
High-Fidelity 3D Lung CT Synthesis in ARDS Swine Models Using Score-Based 3D Residual Diffusion Models
Yoon, Siyeop, Oh, Yujin, Li, Xiang, Xin, Yi, Cereda, Maurizio, Li, Quanzheng
Acute respiratory distress syndrome (ARDS) is a severe condition characterized by lung inflammation and respiratory failure, with a high mortality rate of approximately 40%. Traditional imaging methods, such as chest X-rays, provide only two-dimensional views, limiting their effectiveness in fully assessing lung pathology. Three-dimensional (3D) computed tomography (CT) offers a more comprehensive visualization, enabling detailed analysis of lung aeration, atelectasis, and the effects of therapeutic interventions. However, the routine use of CT in ARDS management is constrained by practical challenges and risks associated with transporting critically ill patients to remote scanners. In this study, we synthesize high-fidelity 3D lung CT from 2D generated X-ray images with associated physiological parameters using a score-based 3D residual diffusion model. Our preliminary results demonstrate that this approach can produce high-quality 3D CT images that are validated with ground truth, offering a promising solution for enhancing ARDS management.
D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models
Wan, Zhongwei, Wu, Xinjian, Zhang, Yu, Xin, Yi, Tao, Chaofan, Zhu, Zhihong, Wang, Xin, Luo, Siqi, Xiong, Jing, Zhang, Mi
Efficient inference in Large Language Models (LLMs) is impeded by the growing memory demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache eviction strategies, which prioritize less critical KV-pairs based on attention scores, often degrade generation quality, leading to issues such as context loss or hallucinations. To address this, we introduce Dynamic Discriminative Operations (D2O), a novel method that utilizes two-level discriminative strategies to optimize KV cache size without fine-tuning, while preserving essential context. Initially, by observing varying densities of attention weights between shallow and deep layers, we use this insight to determine which layers should avoid excessive eviction to minimize information loss. Subsequently, for the eviction strategy in each layer, D2O innovatively incorporates a compensation mechanism that maintains a similarity threshold to re-discriminate the importance of previously discarded tokens, determining whether they should be recalled and merged with similar tokens. Our approach not only achieves significant memory savings and enhances inference throughput by more than 3 times but also maintains high-quality long-text generation. Extensive experiments across various benchmarks and LLM architectures have demonstrated that D2O significantly enhances performance with a constrained KV cache budget.
Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model
Yi, Mingyang, Li, Aoxue, Xin, Yi, Li, Zhenguo
Recently, the strong latent Diffusion Probabilistic Model (DPM) has been applied to high-quality Text-to-Image (T2I) generation (e.g., Stable Diffusion), by injecting the encoded target text prompt into the gradually denoised diffusion image generator. Despite the success of DPM in practice, the mechanism behind it remains to be explored. To fill this blank, we begin by examining the intermediate statuses during the gradual denoising generation process in DPM. The empirical observations indicate, the shape of image is reconstructed after the first few denoising steps, and then the image is filled with details (e.g., texture). The phenomenon is because the low-frequency signal (shape relevant) of the noisy image is not corrupted until the final stage in the forward process (initial stage of generation) of adding noise in DPM. Inspired by the observations, we proceed to explore the influence of each token in the text prompt during the two stages. After a series of experiments of T2I generations conditioned on a set of text prompts. We conclude that in the earlier generation stage, the image is mostly decided by the special token [\texttt{EOS}] in the text prompt, and the information in the text prompt is already conveyed in this stage. After that, the diffusion model completes the details of generated images by information from themselves. Finally, we propose to apply this observation to accelerate the process of T2I generation by properly removing text guidance, which finally accelerates the sampling up to 25\%+.
Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey
Xin, Yi, Luo, Siqi, Zhou, Haodi, Du, Junlong, Liu, Xiaohong, Fan, Yue, Li, Qing, Du, Yuntao
Large-scale pre-trained vision models (PVMs) have As a promising solution, parameter-efficient fine-tuning shown great potential for adaptability across various (PEFT), which was originally proposed in NLP, overcomes downstream vision tasks. However, with stateof-the-art the above challenges by updating a minimal number of parameters PVMs growing to billions or even trillions while potentially achieving comparable or superior of parameters, the standard full fine-tuning performance to full fine-tuning [Hu and et al., 2021; Yu and paradigm is becoming unsustainable due to high et al., 2022]. These approaches hinge on recent advances computational and storage demands. In response, showing that large pre-trained models trained with rich data researchers are exploring parameter-efficient finetuning have strong generalisability and most parameters in the PVMs (PEFT), which seeks to exceed the performance could be shared for the new tasks [Kornblith and et al., 2019; of full fine-tuning with minimal parameter Yu and et al., 2022]. PEFT methods could reduce learnable parameters, modifications. This survey provides a comprehensive which not only facilitates more effective adaptation overview and future directions for visual PEFT, to novel tasks but also safeguards the pre-existing knowledge offering a systematic review of the latest advancements.