Goto

Collaborating Authors

 feature pyramid



Synesthesia of Machines (SoM)-Based Task-Driven MIMO System for Image Transmission

arXiv.org Artificial Intelligence

--T o support cooperative perception (CP) of networked mobile agents in dynamic scenarios, the efficient and robust transmission of sensory data is a critical challenge. Deep learning-based joint source-channel coding (JSCC) has demonstrated promising results for image transmission under adverse channel conditions, outperforming traditional rule-based codecs. While recent works have explored to combine JSCC with the widely adopted multiple-input multiple-output (MIMO) technology, these approaches are still limited to the discrete-time analog transmission (DT A T) model and simple tasks. Given the limited performance of existing MIMO JSCC schemes in supporting complex CP tasks for networked mobile agents with digital MIMO communication systems, this paper presents a Synesthesia of Machines (SoM)-based task-driven MIMO system for image transmission, referred to as SoM-MIMO. By leveraging the structural properties of the feature pyramid for perceptual tasks and the channel properties of the closed-loop MIMO communication system, SoM-MIMO enables efficient and robust digital MIMO transmission of images. Experimental results have shown that compared with two JSCC baseline schemes, our approach achieves average mAP improvements of 6.30 and 10.48 across all SNR levels, while maintaining identical communication overhead. N the era of beyond fifth generation (B5G) and sixth generation (6G), a large number of mobile agents, including autonomous vehicles, unmanned aerial vehicles, and humanoid robots, etc., will interact in real-time and execute diverse intelligent functions, revolutionizing industries and daily life. To enable diverse intelligent functionalities, such as decision-making and task execution, accurate environmental perception--encompassing the acquisition of object position, size, and category--is essential. Manuscript received 24 April 2025; revised 20 July 2025; accepted 26 August 2025. This work was supported in part by the by the National Natural Science Foundation of China under Grant 62125101, Grant 62341101, and Grant 62271351; in part by the New Cornerstone Science Foundation through the XPLORER PRIZE. Rongqing Zhang is with Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China (email: rongqingz@tongji.edu.cn).


Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration

arXiv.org Artificial Intelligence

We propose integrally pre-trained transformer pyramid network (iTPN), towards jointly optimizing the network backbone and the neck, so that transfer gap between representation models and downstream tasks is minimal. iTPN is born with two elaborated designs: 1) The first pre-trained feature pyramid upon vision transformer (ViT). 2) Multi-stage supervision to the feature pyramid using masked feature modeling (MFM). iTPN is updated to Fast-iTPN, reducing computational memory overhead and accelerating inference through two flexible designs. 1) Token migration: dropping redundant tokens of the backbone while replenishing them in the feature pyramid without attention operations. 2) Token gathering: reducing computation cost caused by global attention by introducing few gathering tokens. The base/large-level Fast-iTPN achieve 88.75%/89.5% top-1 accuracy on ImageNet-1K. With 1x training schedule using DINO, the base/large-level Fast-iTPN achieves 58.4%/58.8% box AP on COCO object detection, and a 57.5%/58.7% mIoU on ADE20K semantic segmentation using MaskDINO. Fast-iTPN can accelerate the inference procedure by up to 70%, with negligible performance loss, demonstrating the potential to be a powerful backbone for downstream vision tasks. The code is available at: github.com/sunsmarterjie/iTPN.


Temporal Action Localization with Enhanced Instant Discriminability

arXiv.org Artificial Intelligence

Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video. The unclear boundaries of actions in videos often result in imprecise predictions of action boundaries by existing methods. To resolve this issue, we propose a one-stage framework named TriDet. First, we propose a Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. Then, we analyze the rank-loss problem (i.e. instant discriminability deterioration) in transformer-based methods and propose an efficient scalable-granularity perception (SGP) layer to mitigate this issue. To further push the limit of instant discriminability in the video backbone, we leverage the strong representation capability of pretrained large models and investigate their performance on TAD. Last, considering the adequate spatial-temporal context for classification, we design a decoupled feature pyramid network with separate feature pyramids to incorporate rich spatial context from the large model for localization. Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets, including hierarchical (multilabel) TAD datasets.


UnLoc: A Unified Framework for Video Localization Tasks

arXiv.org Artificial Intelligence

While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task. We design a new approach for this called UnLoc, which uses pretrained image and text towers, and feeds tokens to a video-text fusion model. The output of the fusion module are then used to construct a feature pyramid in which each level connects to a head to predict a per-frame relevancy score and start/end time displacements. Unlike previous works, our architecture enables Moment Retrieval, Temporal Localization, and Action Segmentation with a single stage model, without the need for action proposals, motion based pretrained features or representation masking. Unlike specialized models, we achieve state of the art results on all three different localization tasks with a unified approach. Code will be available at: \url{https://github.com/google-research/scenic}.


End-to-End Entity Detection with Proposer and Regressor

arXiv.org Artificial Intelligence

Named entity recognition is a traditional task in natural language processing. In particular, nested entity recognition receives extensive attention for the widespread existence of the nesting scenario. The latest research migrates the well-established paradigm of set prediction in object detection to cope with entity nesting. However, the manual creation of query vectors, which fail to adapt to the rich semantic information in the context, limits these approaches. An end-to-end entity detection approach with proposer and regressor is presented in this paper to tackle the issues. First, the proposer utilizes the feature pyramid network to generate high-quality entity proposals. Then, the regressor refines the proposals for generating the final prediction. The model adopts encoder-only architecture and thus obtains the advantages of the richness of query semantics, high precision of entity localization, and easiness of model training. Moreover, we introduce the novel spatially modulated attention and progressive refinement for further improvement. Extensive experiments demonstrate that our model achieves advanced performance in flat and nested NER, achieving a new state-of-the-art F1 score of 80.74 on the GENIA dataset and 72.38 on the WeiboNER dataset.


Hierarchical Visual Localization Based on Sparse Feature Pyramid for Adaptive Reduction of Keypoint Map Size

arXiv.org Artificial Intelligence

Visual localization is a fundamental task for a wide range of applications in the field of robotics. Yet, it is still a complex problem with no universal solution, and the existing approaches are difficult to scale: most state-of-the-art solutions are unable to provide accurate localization without a significant amount of storage space. We propose a hierarchical, low-memory approach to localization based on keypoints with different descriptor lengths. It becomes possible with the use of the developed unsupervised neural network, which predicts a feature pyramid with different descriptor lengths for images. This structure allows applying coarse-to-fine paradigms for localization based on keypoint map, and varying the accuracy of localization by changing the type of the descriptors used in the pipeline. Our approach achieves comparable results in localization accuracy and a significant reduction in memory consumption (up to 16 times) among state-of-the-art methods.


TriDet: Temporal Action Detection with Relative Boundary Modeling

arXiv.org Artificial Intelligence

In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose an efficient Scalable-Granularity Perception (SGP) layer to mitigate the rank loss problem of self-attention that takes place in the video features and aggregate information across different temporal granularities. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of $69.3\%$ on THUMOS14, outperforming the previous best by $2.5\%$, but with only $74.6\%$ of its latency. The code is released to https://github.com/sssste/TriDet.


PCRLv2: A Unified Visual Information Preservation Framework for Self-supervised Pre-training in Medical Image Analysis

arXiv.org Artificial Intelligence

Recent advances in self-supervised learning (SSL) in computer vision are primarily comparative, whose goal is to preserve invariant and discriminative semantics in latent representations by comparing siamese image views. However, the preserved high-level semantics do not contain enough local information, which is vital in medical image analysis (e.g., image-based diagnosis and tumor segmentation). To mitigate the locality problem of comparative SSL, we propose to incorporate the task of pixel restoration for explicitly encoding more pixel-level information into high-level semantics. We also address the preservation of scale information, a powerful tool in aiding image understanding but has not drawn much attention in SSL. The resulting framework can be formulated as a multi-task optimization problem on the feature pyramid. Specifically, we conduct multi-scale pixel restoration and siamese feature comparison in the pyramid. In addition, we propose non-skip U-Net to build the feature pyramid and develop sub-crop to replace multi-crop in 3D medical imaging. The proposed unified SSL framework (PCRLv2) surpasses its self-supervised counterparts on various tasks, including brain tumor segmentation (BraTS 2018), chest pathology identification (ChestX-ray, CheXpert), pulmonary nodule detection (LUNA), and abdominal organ segmentation (LiTS), sometimes outperforming them by large margins with limited annotations.


Precise Single-stage Detector

arXiv.org Artificial Intelligence

There are still two problems in SDD causing some inaccurate results: (1) In the process of feature extraction, with the layer-by-layer acquisition of semantic information, local information is gradually lost, resulting into less representative feature maps; (2) During the Non-Maximum Suppression (NMS) algorithm due to inconsistency in classification and regression tasks, the classification confidence and predicted detection position cannot accurately indicate the position of the prediction boxes. Methods: In order to address these aforementioned issues, we propose a new architecture, a modified version of Single Shot Multibox Detector (SSD), named Precise Single Stage Detector (PSSD). Firstly, we improve the features by adding extra layers to SSD. Secondly, we construct a simple and effective feature enhancement module to expand the receptive field step by step for each layer and enhance its local and semantic information. Finally, we design a more efficient loss function to predict the IOU between the prediction boxes and ground truth boxes, and the threshold IOU guides classification training and attenuates the scores, which are used by the NMS algorithm. Main Results: Benefiting from the above optimization, the proposed model PSSD achieves exciting performance in real-time. Specifically, with the hardware of Titan Xp and the input size of 320 pix, PSSD achieves 33.8 mAP at 45 FPS speed on MS COCO benchmark and 81.28 mAP at 66 FPS speed on Pascal VOC 2007 outperforming state-of-the-art object detection models. Besides, the proposed model performs significantly well with larger input size. Under 512 pix, PSSD can obtain 37.2 mAP with 27 FPS on MS COCO and 82.82 mAP with 40 FPS on Pascal VOC 2007. The experiment results prove that the proposed model has a better trade-off between speed and accuracy.