Goto

Collaborating Authors

 Mao, Qi


S$^2$AG-Vid: Enhancing Multi-Motion Alignment in Video Diffusion Models via Spatial and Syntactic Attention-Based Guidance

arXiv.org Artificial Intelligence

Recent advancements in text-to-video (T2V) generation using diffusion models have garnered significant attention. However, existing T2V models primarily focus on simple scenes featuring a single object performing a single motion. Challenges arise in scenarios involving multiple objects with distinct motions, often leading to incorrect video-text alignment between subjects and their corresponding motions. To address this challenge, we propose \textbf{S$^2$AG-Vid}, a training-free inference-stage optimization method that improves the alignment of multiple objects with their corresponding motions in T2V models. S$^2$AG-Vid initially applies a spatial position-based, cross-attention (CA) constraint in the early stages of the denoising process, facilitating multiple nouns distinctly attending to the correct subject regions. To enhance the motion-subject binding, we implement a syntax-guided contrastive constraint in the subsequent denoising phase, aimed at improving the correlations between the CA maps of verbs and their corresponding nouns.Both qualitative and quantitative evaluations demonstrate that the proposed framework significantly outperforms baseline approaches, producing higher-quality videos with improved subject-motion consistency.


Unifying Generation and Compression: Ultra-low bitrate Image Coding Via Multi-stage Transformer

arXiv.org Artificial Intelligence

Recent progress in generative compression technology has significantly improved the perceptual quality of compressed data. However, these advancements primarily focus on producing high-frequency details, often overlooking the ability of generative models to capture the prior distribution of image content, thus impeding further bitrate reduction in extreme compression scenarios (<0.05 bpp). Motivated by the capabilities of predictive language models for lossless compression, this paper introduces a novel Unified Image Generation-Compression (UIGC) paradigm, merging the processes of generation and compression. A key feature of the UIGC framework is the adoption of vector-quantized (VQ) image models for tokenization, alongside a multi-stage transformer designed to exploit spatial contextual information for modeling the prior distribution. As such, the dual-purpose framework effectively utilizes the learned prior for entropy estimation and assists in the regeneration of lost tokens. Extensive experiments demonstrate the superiority of the proposed UIGC framework over existing codecs in perceptual quality and human perception, particularly in ultra-low bitrate scenarios (<=0.03 bpp), pioneering a new direction in generative compression.


Scalable Face Image Coding via StyleGAN Prior: Towards Compression for Human-Machine Collaborative Vision

arXiv.org Artificial Intelligence

The accelerated proliferation of visual content and the rapid development of machine vision technologies bring significant challenges in delivering visual data on a gigantic scale, which shall be effectively represented to satisfy both human and machine requirements. In this work, we investigate how hierarchical representations derived from the advanced generative prior facilitate constructing an efficient scalable coding paradigm for human-machine collaborative vision. Our key insight is that by exploiting the StyleGAN prior, we can learn three-layered representations encoding hierarchical semantics, which are elaborately designed into the basic, middle, and enhanced layers, supporting machine intelligence and human visual perception in a progressive fashion. With the aim of achieving efficient compression, we propose the layer-wise scalable entropy transformer to reduce the redundancy between layers. Based on the multi-task scalable rate-distortion objective, the proposed scheme is jointly optimized to achieve optimal machine analysis performance, human perception experience, and compression ratio. We validate the proposed paradigm's feasibility in face image compression. Extensive qualitative and quantitative experimental results demonstrate the superiority of the proposed paradigm over the latest compression standard Versatile Video Coding (VVC) in terms of both machine analysis as well as human perception at extremely low bitrates ($<0.01$ bpp), offering new insights for human-machine collaborative compression.


MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance

arXiv.org Artificial Intelligence

However, localized editing in complex the other hand, mask-free methods that utilize attention injection scenarios has not been well-studied in the literature, despite mechanisms such as Prompt-to-Prompt (P2P) [10] its growing real-world demands. Existing mask-based and Plug-and-Play (PnP) [28] can preserve the original image's inpainting methods fall short of retaining the underlying structure and layout. Nevertheless, they struggle to structure within the edit region. Meanwhile, mask-free precisely align the local editing region with the intended attention-based methods often exhibit editing leakage and text in intricate scenarios, largely due to their reliance on the misalignment in more complex compositions. In this work, text prompts' localization capabilities. As a result, editing we develop MAG-Edit, a training-free, inference-stage optimization effects often extend beyond the intended area and impact method, which enables localized image editing in incorrect regions, as shown in the fourth column of Figure 1.


Latent Smooth Skeleton Embedding

AAAI Conferences

Learning a smooth skeleton in a low-dimensional space from noisy data becomes important in computer vision and computational biology. Existing methods assume that the manifold constructed from the data is smooth, but they lack the ability to model skeleton structures from noisy data. To overcome this issue, we propose a novel probabilistic structured learning model to learn the density of latent embedding given high-dimensional data and its neighborhood graph. The embedded points that form a smooth skeleton structure are obtained by maximum a posteriori (MAP) estimation. Our analysis shows that the resulting similarity matrix is sparse and unique, and its associated kernel has eigenvalues that follow a power law distribution, which leads to the embeddings of a smooth skeleton. The model is extended to learn a sparse similarity matrix when the graph structure is unknown. Extensive experiments demonstrate the effectiveness of the proposed methods on various datasets by comparing them with existing methods.


A Novel Regularized Principal Graph Learning Framework on Explicit Graph Representation

arXiv.org Machine Learning

Many scientific datasets are of high dimension, and the analysis usually requires visual manipulation by retaining the most important structures of data. Principal curve is a widely used approach for this purpose. However, many existing methods work only for data with structures that are not self-intersected, which is quite restrictive for real applications. A few methods can overcome the above problem, but they either require complicated human-made rules for a specific task with lack of convergence guarantee and adaption flexibility to different tasks, or cannot obtain explicit structures of data. To address these issues, we develop a new regularized principal graph learning framework that captures the local information of the underlying graph structure based on reversed graph embedding. As showcases, models that can learn a spanning tree or a weighted undirected $\ell_1$ graph are proposed, and a new learning algorithm is developed that learns a set of principal points and a graph structure from data, simultaneously. The new algorithm is simple with guaranteed convergence. We then extend the proposed framework to deal with large-scale data. Experimental results on various synthetic and six real world datasets show that the proposed method compares favorably with baselines and can uncover the underlying structure correctly.


A Split-Merge Framework for Comparing Clusterings

arXiv.org Machine Learning

Clustering evaluation measures are frequently used to evaluate the performance of algorithms. However, most measures are not properly normalized and ignore some information in the inherent structure of clusterings. We model the relation between two clusterings as a bipartite graph and propose a general component-based decomposition formula based on the components of the graph. Most existing measures are examples of this formula. In order to satisfy consistency in the component, we further propose a split-merge framework for comparing clusterings of different data sets. Our framework gives measures that are conditionally normalized, and it can make use of data point information, such as feature vectors and pairwise distances. We use an entropy-based instance of the framework and a coreference resolution data set to demonstrate empirically the utility of our framework over other measures.


Parameter-Free Spectral Kernel Learning

arXiv.org Machine Learning

Due to the growing ubiquity of unlabeled data, learning with unlabeled data is attracting increasing attention in machine learning. In this paper, we propose a novel semi-supervised kernel learning method which can seamlessly combine manifold structure of unlabeled data and Regularized Least-Squares (RLS) to learn a new kernel. Interestingly, the new kernel matrix can be obtained analytically with the use of spectral decomposition of graph Laplacian matrix. Hence, the proposed algorithm does not require any numerical optimization solvers. Moreover, by maximizing kernel target alignment on labeled data, we can also learn model parameters automatically with a closed-form solution. For a given graph Laplacian matrix, our proposed method does not need to tune any model parameter including the tradeoff parameter in RLS and the balance parameter for unlabeled data. Extensive experiments on ten benchmark datasets show that our proposed two-stage parameter-free spectral kernel learning algorithm can obtain comparable performance with fine-tuned manifold regularization methods in transductive setting, and outperform multiple kernel learning in supervised setting.