Luo, Xiangyang
SUMI-IFL: An Information-Theoretic Framework for Image Forgery Localization with Sufficiency and Minimality Constraints
Sheng, Ziqi, Lu, Wei, Luo, Xiangyang, Zhou, Jiantao, Cao, Xiaochun
Image forgery localization (IFL) is a crucial technique for preventing tampered image misuse and protecting social safety. However, due to the rapid development of image tampering technologies, extracting more comprehensive and accurate forgery clues remains an urgent challenge. To address these challenges, we introduce a novel information-theoretic IFL framework named SUMI-IFL that imposes sufficiency-view and minimality-view constraints on forgery feature representation. First, grounded in the theoretical analysis of mutual information, the sufficiency-view constraint is enforced on the feature extraction network to ensure that the latent forgery feature contains comprehensive forgery clues. Considering that forgery clues obtained from a single aspect alone may be incomplete, we construct the latent forgery feature by integrating several individual forgery features from multiple perspectives. Second, based on the information bottleneck, the minimality-view constraint is imposed on the feature reasoning network to achieve an accurate and concise forgery feature representation that counters the interference of task-unrelated features. Extensive experiments show the superior performance of SUMI-IFL to existing state-of-the-art methods, not only on in-dataset comparisons but also on cross-dataset comparisons.
PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis
Xie, Yifan, Feng, Tao, Zhang, Xin, Luo, Xiangyang, Guo, Zixuan, Yu, Weijiang, Chang, Heng, Ma, Fei, Yu, Fei Richard
Talking head synthesis with arbitrary speech audio is a crucial challenge in the field of digital humans. Recently, methods based on radiance fields have received increasing attention due to their ability to synthesize high-fidelity and identity-consistent talking heads from just a few minutes of training video. However, due to the limited scale of the training data, these methods often exhibit poor performance in audio-lip synchronization and visual quality. In this paper, we propose a novel 3D Gaussian-based method called PointTalk, which constructs a static 3D Gaussian field of the head and deforms it in sync with the audio. It also incorporates an audio-driven dynamic lip point cloud as a critical component of the conditional information, thereby facilitating the effective synthesis of talking heads. Specifically, the initial step involves generating the corresponding lip point cloud from the audio signal and capturing its topological structure. The design of the dynamic difference encoder aims to capture the subtle nuances inherent in dynamic lip movements more effectively. Furthermore, we integrate the audio-point enhancement module, which not only ensures the synchronization of the audio signal with the corresponding lip point cloud within the feature space, but also facilitates a deeper understanding of the interrelations among cross-modal conditional features. Extensive experiments demonstrate that our method achieves superior high-fidelity and audio-lip synchronization in talking head synthesis compared to previous methods.
ID-Guard: A Universal Framework for Combating Facial Manipulation via Breaking Identification
Qu, Zuomin, Lu, Wei, Luo, Xiangyang, Wang, Qian, Cao, Xiaochun
The misuse of deep learning-based facial manipulation poses a potential threat to civil rights. To prevent this fraud at its source, proactive defense technology was proposed to disrupt the manipulation process by adding invisible adversarial perturbations into images, making the forged output unconvincing to the observer. However, their non-directional disruption of the output may result in the retention of identity information of the person in the image, leading to stigmatization of the individual. In this paper, we propose a novel universal framework for combating facial manipulation, called ID-Guard. Specifically, this framework requires only a single forward pass of an encoder-decoder network to generate a cross-model universal adversarial perturbation corresponding to a specific facial image. To ensure anonymity in manipulated facial images, a novel Identity Destruction Module (IDM) is introduced to destroy the identifiable information in forged faces targetedly. Additionally, we optimize the perturbations produced by considering the disruption towards different facial manipulations as a multi-task learning problem and design a dynamic weights strategy to improve cross-model performance. The proposed framework reports impressive results in defending against multiple widely used facial manipulations, effectively distorting the identifiable regions in the manipulated facial images. In addition, our experiments reveal the ID-Guard's ability to enable disrupted images to avoid face inpaintings and open-source image recognition systems.
Efficient Sparse Attention needs Adaptive Token Release
Zhang, Chaoran, Zou, Lixin, Luo, Dan, Tang, Min, Luo, Xiangyang, Li, Zihao, Li, Chenliang
In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide array of text-centric tasks. However, their `large' scale introduces significant computational and storage challenges, particularly in managing the key-value states of the transformer, which limits their wider applicability. Therefore, we propose to adaptively release resources from caches and rebuild the necessary key-value states. Particularly, we accomplish this by a lightweight controller module to approximate an ideal top-$K$ sparse attention. This module retains the tokens with the highest top-$K$ attention weights and simultaneously rebuilds the discarded but necessary tokens, which may become essential for future decoding. Comprehensive experiments in natural language generation and modeling reveal that our method is not only competitive with full attention in terms of performance but also achieves a significant throughput improvement of up to 221.8%. The code for replication is available on the https://github.com/WHUIR/ADORE.
Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning
Liu, Guisheng, Li, Yi, Fei, Zhengcong, Fu, Haiyan, Luo, Xiangyang, Guo, Yanqing
While impressive performance has been achieved in image captioning, the limited diversity of the generated captions and the large parameter scale remain major barriers to the real-word application of these systems. In this work, we propose a lightweight image captioning network in combination with continuous diffusion, called Prefix-diffusion. To achieve diversity, we design an efficient method that injects prefix image embeddings into the denoising process of the diffusion model. In order to reduce trainable parameters, we employ a pre-trained model to extract image features and further design an extra mapping network. Prefix-diffusion is able to generate diverse captions with relatively less parameters, while maintaining the fluency and relevance of the captions benefiting from the generative capabilities of the diffusion model. Our work paves the way for scaling up diffusion models for image captioning, and achieves promising performance compared with recent approaches.
GNN-Geo: A Graph Neural Network-based Fine-grained IP geolocation Framework
Ding, Shichang, Luo, Xiangyang, Wang, Jinwei, Fu, Xiaoming
Abstract--Rule-based fine-grained IP geolocation methods are hard to generalize in computer networks which do not follow hypothetical rules. Recently, deep learning methods, like multi-layer perceptron (MLP), are tried to increase generalization capabilities. However, MLP is not so suitable for graph-structured data like networks. MLP treats IP addresses as isolated instances and ignores the connection information, which limits geolocation accuracy. In this work, we research how to increase the generalization capability with an emerging graph deep learning method - Graph Neural Network (GNN). First, IP geolocation is re-formulated as an attributed graph node regression problem. Then, we propose a GNN-based IP geolocation framework named GNN-Geo. GNN-Geo consists of a preprocessor, an encoder, messaging passing (MP) layers and a decoder. The preprocessor and encoder transform measurement data into the initial node embeddings. MP layers refine the initial node embeddings by modeling the connection information. The decoder maps the refined embeddings to nodes' locations and relieves the convergence problem by considering prior knowledge. The experiments in 8 real-world IPv4/IPv6 networks in North America, Europe and Asia show the proposed GNN-Geo clearly outperforms the state-of-art rule-based and learning-based baselines. P geolocation is to obtain the geographic location (geolocation) of an IP address [1, 2]. It is widely used in localized IP after the probing host [2] obtains the measurement data online advertisement, location-based content restriction, cybercrime (e.g., delay) to the target IP and landmarks [2, 5]. IP geolocation is important are IP addresses with known locations [2]. Generally, existing especially for networking devices without GPS (Global Positioning measurement-based fine-grained IP geolocation methods can be System) function, such as PCs and routers. IP geolocation categorized into two kinds: rule-based and deep learning-based can be categorized into two kinds by accuracy: coarse-grained methods (referred as learning-based methods in this paper).
DLocRL: A Deep Learning Pipeline for Fine-Grained Location Recognition and Linking in Tweets
Xu, Canwen, Li, Jing, Luo, Xiangyang, Pei, Jiaxin, Li, Chenliang, Ji, Donghong
In recent years, with the prevalence of social media and smart devices, people causally reveal their locations such as shops, hotels, and restaurants in their tweets. Recognizing and linking such fine-grained location mentions to well-defined location profiles are beneficial for retrieval and recommendation systems. In this paper, we propose DLocRL, a new deep learning pipeline for fine-grained location recognition and linking in tweets, and verify its effectiveness on a real-world Twitter dataset.
NEXT: A Neural Network Framework for Next POI Recommendation
Zhang, Zhiqian, Li, Chenliang, Wu, Zhiyong, Sun, Aixin, Ye, Dengpan, Luo, Xiangyang
The task of next POI recommendation has been studied extensively in recent years. However, developing an unified recommendation framework to incorporate multiple factors associated with both POIs and users remains challenging, because of the heterogeneity nature of these information. Further, effective mechanisms to handle cold-start and endow the system with interpretability are also difficult topics. Inspired by the recent success of neural networks in many areas, in this paper, we present a simple but effective neural network framework for next POI recommendation, named NEXT. NEXT is an unified framework to learn the hidden intent regarding user's next move, by incorporating different factors in an unified manner. Specifically, in NEXT, we incorporate meta-data information and two kinds of temporal contexts (i.e., time interval and visit time). To leverage sequential relations and geographical influence, we propose to adopt DeepWalk, a network representation learning technique, to encode such knowledge. We evaluate the effectiveness of NEXT against state-of-the-art alternatives and neural networks based solutions. Experimental results over three publicly available datasets demonstrate that NEXT significantly outperforms baselines in real-time next POI recommendation. Further experiments demonstrate the superiority of NEXT in handling cold-start. More importantly, we show that NEXT provides meaningful explanation of the dimensions in hidden intent space.