Xiao, Jimin
Image Fusion for Cross-Domain Sequential Recommendation
Wu, Wangyu, Song, Siqi, Qiu, Xianglin, Huang, Xiaowei, Ma, Fei, Xiao, Jimin
Cross-Domain Sequential Recommendation (CDSR) aims to predict future user interactions based on historical interactions across multiple domains. The key challenge in CDSR is effectively capturing cross-domain user preferences by fully leveraging both intra-sequence and inter-sequence item interactions. In this paper, we propose a novel method, Image Fusion for Cross-Domain Sequential Recommendation (IFCDSR), which incorporates item image information to better capture visual preferences. Our approach integrates a frozen CLIP model to generate image embeddings, enriching original item embeddings with visual data from both intra-sequence and inter-sequence interactions. Additionally, we employ a multiple attention layer to capture cross-domain interests, enabling joint learning of single-domain and cross-domain user preferences. To validate the effectiveness of IFCDSR, we re-partitioned four e-commerce datasets and conducted extensive experiments. Results demonstrate that IFCDSR significantly outperforms existing methods.
CNC: Cross-modal Normality Constraint for Unsupervised Multi-class Anomaly Detection
Wang, Xiaolei, Wang, Xiaoyang, Bai, Huihui, Lim, Eng Gee, Xiao, Jimin
Existing unsupervised distillation-based methods rely on the differences between encoded and decoded features to locate abnormal regions in test images. However, the decoder trained only on normal samples still reconstructs abnormal patch features well, degrading performance. This issue is particularly pronounced in unsupervised multi-class anomaly detection tasks. We attribute this behavior to over-generalization(OG) of decoder: the significantly increasing diversity of patch patterns in multi-class training enhances the model generalization on normal patches, but also inadvertently broadens its generalization to abnormal patches. To mitigate OG, we propose a novel approach that leverages class-agnostic learnable prompts to capture common textual normality across various visual patterns, and then apply them to guide the decoded features towards a normal textual representation, suppressing over-generalization of the decoder on abnormal patterns. To further improve performance, we also introduce a gated mixture-of-experts module to specialize in handling diverse patch patterns and reduce mutual interference between them in multi-class training. Our method achieves competitive performance on the MVTec AD and VisA datasets, demonstrating its effectiveness.
Event USKT : U-State Space Model in Knowledge Transfer for Event Cameras
Lin, Yuhui, Zhang, Jiahao, Li, Siyuan, Xiao, Jimin, Xu, Ding, Wu, Wenjun, Lu, Jiaxuan
Event cameras, as an emerging imaging technology, offer distinct advantages over traditional RGB cameras, including reduced energy consumption and higher frame rates. However, the limited quantity of available event data presents a significant challenge, hindering their broader development. To alleviate this issue, we introduce a tailored U-shaped State Space Model Knowledge Transfer (USKT) framework for Event-to-RGB knowledge transfer. This framework generates inputs compatible with RGB frames, enabling event data to effectively reuse pre-trained RGB models and achieve competitive performance with minimal parameter tuning. Within the USKT architecture, we also propose a bidirectional reverse state space model. Unlike conventional bidirectional scanning mechanisms, the proposed Bidirectional Reverse State Space Model (BiR-SSM) leverages a shared weight strategy, which facilitates efficient modeling while conserving computational resources. In terms of effectiveness, integrating USKT with ResNet50 as the backbone improves model performance by 0.95%, 3.57%, and 2.9% on DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets, respectively, underscoring USKT's adaptability and effectiveness. The code will be made available upon acceptance.
High-Frequency Enhanced Hybrid Neural Representation for Video Compression
Yu, Li, Li, Zhihui, Xiao, Jimin, Gabbouj, Moncef
According to statistics, in 2023, more than 65% of total Internet traffic is video content (Corporation, 2023), and this percentage is expected to continue increasing. In the past, video compression was usually achieved by traditional codecs like H.264/AVC (Wiegand et al., 2003), H.265/HEVC (Sullivan et al., 2012), H.266/VVC (Bross et al., 2021), and AVS (Zhang et al., 2019). However, the handcrafted algorithms in these traditional codecs would limit the compression efficiency. With the rise of deep learning, many neural video codec (NVC) technologies have been proposed (Lu et al., 2019; Li et al., 2021; Agustsson et al., 2020; Wang et al., 2024b). These approaches replace handcrafted components with deep learning modules, achieving impressive rate-distortion performance. However, these NVC approaches have not yet achieved widespread adoption in practical applications. One reason for this is that these approaches often require a large network to achieve generalized compression over the entire data distribution, which is more computationally intensive and frequently leads to slower decoding speeds compared to traditional codecs. Moreover, the generalization capability of the network depends on the dataset used for model training, leading to poor performance on out-of-distribution (OOD) data from different domains (Zhang et al., 2021a), and even when the resolution changes. To overcome these challenges associated with NVCs, researchers have turned to implicit neural representations (INRs) as a promising alternative.
SFC: Shared Feature Calibration in Weakly Supervised Semantic Segmentation
Zhao, Xinqiao, Tang, Feilong, Wang, Xiaoyang, Xiao, Jimin
Image-level weakly supervised semantic segmentation has received increasing attention due to its low annotation cost. Existing methods mainly rely on Class Activation Mapping (CAM) to obtain pseudo-labels for training semantic segmentation models. In this work, we are the first to demonstrate that long-tailed distribution in training data can cause the CAM calculated through classifier weights over-activated for head classes and under-activated for tail classes due to the shared features among head- and tail- classes. This degrades pseudo-label quality and further influences final semantic segmentation performance. To address this issue, we propose a Shared Feature Calibration (SFC) method for CAM generation. Specifically, we leverage the class prototypes that carry positive shared features and propose a Multi-Scaled Distribution-Weighted (MSDW) consistency loss for narrowing the gap between the CAMs generated through classifier weights and class prototypes during training. The MSDW loss counterbalances over-activation and under-activation by calibrating the shared features in head-/tail-class classifier weights. Experimental results show that our SFC significantly improves CAM boundaries and achieves new state-of-the-art performances. The project is available at https://github.com/Barrett-python/SFC.
Trajectory Poisson multi-Bernoulli mixture filter for traffic monitoring using a drone
Garcรญa-Fernรกndez, รngel F., Xiao, Jimin
This paper proposes a multi-object tracking (MOT) algorithm for traffic monitoring using a drone equipped with optical and thermal cameras. Object detections on the images are obtained using a neural network for each type of camera. The cameras are modelled as direction-of-arrival (DOA) sensors. Each DOA detection follows a von-Mises Fisher distribution, whose mean direction is obtain by projecting a vehicle position on the ground to the camera. We then use the trajectory Poisson multi-Bernoulli mixture filter (TPMBM), which is a Bayesian MOT algorithm, to optimally estimate the set of vehicle trajectories. We have also developed a parameter estimation algorithm for the measurement model. We have tested the accuracy of the resulting TPMBM filter in synthetic and experimental data sets.
Synchronize Feature Extracting and Matching: A Single Branch Framework for 3D Object Tracking
Ma, Teli, Wang, Mengmeng, Xiao, Jimin, Wu, Huifeng, Liu, Yong
Siamese network has been a de facto benchmark framework for 3D LiDAR object tracking with a shared-parametric encoder extracting features from template and search region, respectively. This paradigm relies heavily on an additional matching network to model the cross-correlation/similarity of the template and search region. In this paper, we forsake the conventional Siamese paradigm and propose a novel single-branch framework, SyncTrack, synchronizing the feature extracting and matching to avoid forwarding encoder twice for template and search region as well as introducing extra parameters of matching network. The synchronization mechanism is based on the dynamic affinity of the Transformer, and an in-depth analysis of the relevance is provided theoretically. Moreover, based on the synchronization, we introduce a novel Attentive Points-Sampling strategy into the Transformer layers (APST), replacing the random/Farthest Points Sampling (FPS) method with sampling under the supervision of attentive relations between the template and search region. It implies connecting point-wise sampling with the feature learning, beneficial to aggregating more distinctive and geometric features for tracking with sparse points. Extensive experiments on two benchmark datasets (KITTI and NuScenes) show that SyncTrack achieves state-of-the-art performance in real-time tracking.
Generative Adversarial Classifier for Handwriting Characters Super-Resolution
Qian, Zhuang, Huang, Kaizhu, Wang, Qiufeng, Xiao, Jimin, Zhang, Rui
Generative Adversarial Networks (GAN) receive great attentions recently due to its excellent performance in image generation, transformation, and super-resolution. However, GAN has rarely been studied and trained for classification, leading that the generated images may not be appropriate for classification. In this paper, we propose a novel Generative Adversarial Classifier (GAC) particularly for low-resolution Handwriting Character Recognition. Specifically, involving additionally a classifier in the training process of normal GANs, GAC is calibrated for learning suitable structures and restored characters images that benefits the classification. Experimental results show that our proposed method can achieve remarkable performance in handwriting characters 8x super-resolution, approximately 10% and 20% higher than the present state-of-the-art methods respectively on benchmark data CASIA-HWDB1.1 and MNIST.