Cai, Zhiping
MetaTrading: An Immersion-Aware Model Trading Framework for Vehicular Metaverse Services
Wu, Hongjia, Zeng, Hui, Xiong, Zehui, Kang, Jiawen, Cai, Zhiping, Chan, Tse-Tin, Niyato, Dusit, Han, Zhu
Updates of extensive Internet of Things (IoT) data are critical to the immersion of vehicular metaverse services. However, providing high-quality and sustainable data in unstable and resource-constrained vehicular networks remains a significant challenge. To address this problem, we put forth a novel immersion-aware model trading framework that incentivizes metaverse users (MUs) to contribute learning models trained by their latest local data for augmented reality (AR) services in the vehicular metaverse, while preserving their privacy through federated learning. To comprehensively evaluate the contribution of locally trained learning models provided by MUs to AR services, we design a new immersion metric that captures service immersion by considering the freshness and accuracy of learning models, as well as the amount and potential value of raw data used for training. We model the trading interactions between metaverse service providers (MSPs) and MUs as an equilibrium problem with equilibrium constraints (EPEC) to analyze and balance their costs and gains. Moreover, considering dynamic network conditions and privacy concerns, we formulate the reward decisions of MSPs as a multi-agent Markov decision process. Then, a fully distributed dynamic reward method based on deep reinforcement learning is presented, which operates without any private information about MUs and other MSPs. Experimental results demonstrate that the proposed framework can effectively provide higher-value models for object detection and classification in AR services on real AR-related vehicle datasets compared to benchmark schemes.
RARE: Robust Masked Graph Autoencoder
Tu, Wenxuan, Liao, Qing, Zhou, Sihang, Peng, Xin, Ma, Chuan, Liu, Zhe, Liu, Xinwang, Cai, Zhiping
Masked graph autoencoder (MGAE) has emerged as a promising self-supervised graph pre-training (SGP) paradigm due to its simplicity and effectiveness. However, existing efforts perform the mask-then-reconstruct operation in the raw data space as is done in computer vision (CV) and natural language processing (NLP) areas, while neglecting the important non-Euclidean property of graph data. As a result, the highly unstable local connection structures largely increase the uncertainty in inferring masked data and decrease the reliability of the exploited self-supervision signals, leading to inferior representations for downstream evaluations. To address this issue, we propose a novel SGP method termed Robust mAsked gRaph autoEncoder (RARE) to improve the certainty in inferring masked data and the reliability of the self-supervision mechanism by further masking and reconstructing node samples in the high-order latent feature space. Through both theoretical and empirical analyses, we have discovered that performing a joint mask-then-reconstruct strategy in both latent feature and raw data spaces could yield improved stability and performance. To this end, we elaborately design a masked latent feature completion scheme, which predicts latent features of masked nodes under the guidance of high-order sample correlations that are hard to be observed from the raw data perspective. Specifically, we first adopt a latent feature predictor to predict the masked latent features from the visible ones. Next, we encode the raw data of masked samples with a momentum graph encoder and subsequently employ the resulting representations to improve predicted results through latent feature matching. Extensive experiments on seventeen datasets have demonstrated the effectiveness and robustness of RARE against state-of-the-art (SOTA) competitors across three downstream tasks.
Revisiting Initializing Then Refining: An Incomplete and Missing Graph Imputation Network
Tu, Wenxuan, Xiao, Bin, Liu, Xinwang, Zhou, Sihang, Cai, Zhiping, Cheng, Jieren
With the development of various applications, such as social networks and knowledge graphs, graph data has been ubiquitous in the real world. Unfortunately, graphs usually suffer from being absent due to privacy-protecting policies or copyright restrictions during data collection. The absence of graph data can be roughly categorized into attribute-incomplete and attribute-missing circumstances. Specifically, attribute-incomplete indicates that a part of the attribute vectors of all nodes are incomplete, while attribute-missing indicates that the whole attribute vectors of partial nodes are missing. Although many efforts have been devoted, none of them is custom-designed for a common situation where both types of graph data absence exist simultaneously. To fill this gap, we develop a novel network termed Revisiting Initializing Then Refining (RITR), where we complete both attribute-incomplete and attribute-missing samples under the guidance of a novel initializing-then-refining imputation criterion. Specifically, to complete attribute-incomplete samples, we first initialize the incomplete attributes using Gaussian noise before network learning, and then introduce a structure-attribute consistency constraint to refine incomplete values by approximating a structure-attribute correlation matrix to a high-order structural matrix. To complete attribute-missing samples, we first adopt structure embeddings of attribute-missing samples as the embedding initialization, and then refine these initial values by adaptively aggregating the reliable information of attribute-incomplete samples according to a dynamic affinity structure. To the best of our knowledge, this newly designed method is the first unsupervised framework dedicated to handling hybrid-absent graphs. Extensive experiments on four datasets have verified that our methods consistently outperform existing state-of-the-art competitors.
Video Abnormal Event Detection by Learning to Complete Visual Cloze Tests
Wang, Siqi, Yu, Guang, Cai, Zhiping, Liu, Xinwang, Zhu, En, Yin, Jianping, Liao, Qing
Video abnormal event detection (VAD) is a vital semi-supervised task that requires learning with only roughly labeled normal videos, as anomalies are often practically unavailable. Although deep neural networks (DNNs) enable great progress in VAD, existing solutions typically suffer from two issues: (1) The precise and comprehensive localization of video events is ignored. (2) The video semantics and temporal context are under-explored. To address those issues, we are motivated by the prevalent cloze test in education and propose a novel approach named visual cloze completion (VCC), which performs VAD by learning to complete "visual cloze tests" (VCTs). Specifically, VCC first localizes each video event and encloses it into a spatio-temporal cube (STC). To achieve both precise and comprehensive localization, appearance and motion are used as mutually complementary cues to mark the object region associated with each video event. For each marked region, a normalized patch sequence is extracted from temporally adjacent frames and stacked into the STC. By comparing each patch and the patch sequence of a STC to a visual "word" and "sentence" respectively, we can deliberately erase a certain "word" (patch) to yield a VCT. DNNs are then trained to infer the erased patch by video semantics, so as to complete the VCT. To fully exploit the temporal context, each patch in STC is alternatively erased to create multiple VCTs, and the erased patch's optical flow is also inferred to integrate richer motion clues. Meanwhile, a new DNN architecture is designed as a model-level solution to utilize video semantics and temporal context. Extensive experiments demonstrate that VCC achieves state-of-the-art VAD performance. Our codes and results are open at \url{https://github.com/yuguangnudt/VEC_VAD/tree/VCC}
Multi-View Spectral Clustering with High-Order Optimal Neighborhood Laplacian Matrix
Liang, Weixuan, Zhou, Sihang, Xiong, Jian, Liu, Xinwang, Wang, Siwei, Zhu, En, Cai, Zhiping, Xu, Xin
Multi-view spectral clustering can effectively reveal the intrinsic cluster structure among data by performing clustering on the learned optimal embedding across views. Though demonstrating promising performance in various applications, most of existing methods usually linearly combine a group of pre-specified first-order Laplacian matrices to construct the optimal Laplacian matrix, which may result in limited representation capability and insufficient information exploitation. Also, storing and implementing complex operations on the $n\times n$ Laplacian matrices incurs intensive storage and computation complexity. To address these issues, this paper first proposes a multi-view spectral clustering algorithm that learns a high-order optimal neighborhood Laplacian matrix, and then extends it to the late fusion version for accurate and efficient multi-view clustering. Specifically, our proposed algorithm generates the optimal Laplacian matrix by searching the neighborhood of the linear combination of both the first-order and high-order base Laplacian matrices simultaneously. By this way, the representative capacity of the learned optimal Laplacian matrix is enhanced, which is helpful to better utilize the hidden high-order connection information among data, leading to improved clustering performance. We design an efficient algorithm with proved convergence to solve the resultant optimization problem. Extensive experimental results on nine datasets demonstrate the superiority of our algorithm against state-of-the-art methods, which verifies the effectiveness and advantages of the proposed algorithm.