yang yang
The Solution for Temporal Action Localisation Task of Perception Test Challenge 2024
Han, Yinan, Jiang, Qingyuan, Mei, Hongming, Yang, Yang, Tang, Jinhui
Each action is represented by start and end timestamps along This report presents our method for Temporal Action with its corresponding class label, as illustrated in Figure1. Localisation (TAL), which focuses on identifying and classifying This task is critical for various applications, including actions within specific time intervals throughout a video surveillance, content analysis, and human-computer video sequence. We employ a data augmentation technique interaction.The dataset provided for this challenge is derived by expanding the training dataset using overlapping labels from the Perception Test, comprising high-resolution from the Something-SomethingV2 dataset, enhancing the videos (up to 35 seconds long, 30fps, and a maximum resolution model's ability to generalize across various action classes. of 1080p). Each video contains multiple action segment For feature extraction, we utilize state-of-the-art models, including annotations. To facilitate experimentation, both video UMT, VideoMAEv2 for video features, and BEATs and audio features are provided, along with detailed annotations and CAV-MAE for audio features. Our approach involves for the training and validation phases.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (7 more...)
Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024
Gu, Haowei, Zhu, Weihao, Yang, Yang
This report proposes an improved method for the Temporal Sound Localisation (TSL) task, which localizes and classifies the sound events occurring in the video according to a predefined set of sound classes. The champion solution from last year's first competition has explored the TSL by fusing audio and video modalities with the same weight. Considering the TSL task aims to localize sound events, we conduct relevant experiments that demonstrated the superiority of sound features (Section 3). Based on our findings, to enhance audio modality features, we employ various models to extract audio features, such as InterVideo, CaVMAE, and VideoMAE models. Our approach ranks first in the final test with a score of 0.4925.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- (8 more...)
The Solution for the 5th GCAIAC Zero-shot Referring Expression Comprehension Challenge
Huang, Longfei, Yu, Feng, Guan, Zhihao, Wan, Zhonghua, Yang, Yang
This report presents a solution for the zero-shot referring expression comprehension task. Visual-language multimodal base models (such as CLIP, SAM) have gained significant attention in recent years as a cornerstone of mainstream research. One of the key applications of multimodal base models lies in their ability to generalize to zero-shot downstream tasks. Unlike traditional referring expression comprehension, zero-shot referring expression comprehension aims to apply pre-trained visual-language models directly to the task without specific training. Recent studies have enhanced the zero-shot performance of multimodal base models in referring expression comprehension tasks by introducing visual prompts. To address the zero-shot referring expression comprehension challenge, we introduced a combination of visual prompts and considered the influence of textual prompts, employing joint prediction tailored to the data characteristics. Ultimately, our approach achieved accuracy rates of 84.825 on the A leaderboard and 71.460 on the B leaderboard, securing the first position.
- Oceania > Australia > Victoria > Melbourne (0.06)
- Europe > Portugal > Lisbon > Lisbon (0.05)
- Asia > China > Jiangsu Province > Nanjing (0.05)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
The Solution for the sequential task continual learning track of the 2nd Greater Bay Area International Algorithm Competition
Pan, Sishun, Wu, Xixian, Li, Tingmin, Huang, Longfei, Feng, Mingxu, Wan, Zhonghua, Yang, Yang
This paper presents a data-free, parameter-isolation-based continual learning algorithm we developed for the sequential task continual learning track of the 2nd Greater Bay Area International Algorithm Competition. The method learns an independent parameter subspace for each task within the network's convolutional and linear layers and freezes the batch normalization layers after the first task. Specifically, for domain incremental setting where all domains share a classification head, we freeze the shared classification head after first task is completed, effectively solving the issue of catastrophic forgetting. Additionally, facing the challenge of domain incremental settings without providing a task identity, we designed an inference task identity strategy, selecting an appropriate mask matrix for each sample. Furthermore, we introduced a gradient supplementation strategy to enhance the importance of unselected parameters for the current task, facilitating learning for new tasks. We also implemented an adaptive importance scoring strategy that dynamically adjusts the amount of parameters to optimize single-task performance while reducing parameter usage. Moreover, considering the limitations of storage space and inference time, we designed a mask matrix compression strategy to save storage space and improve the speed of encryption and decryption of the mask matrix. Our approach does not require expanding the core network or using external auxiliary networks or data, and performs well under both task incremental and domain incremental settings. This solution ultimately won a second-place prize in the competition.
The Solution for the GAIIC2024 RGB-TIR object detection Challenge
Wu, Xiangyu, Xu, Jinling, Huang, Longfei, Yang, Yang
This report introduces a solution to The task of RGB-TIR object detection from the perspective of unmanned aerial vehicles. Unlike traditional object detection methods, RGB-TIR object detection aims to utilize both RGB and TIR images for complementary information during detection. The challenges of RGB-TIR object detection from the perspective of unmanned aerial vehicles include highly complex image backgrounds, frequent changes in lighting, and uncalibrated RGB-TIR image pairs. To address these challenges at the model level, we utilized a lightweight YOLOv9 model with extended multi-level auxiliary branches that enhance the model's robustness, making it more suitable for practical applications in unmanned aerial vehicle scenarios. For image fusion in RGB-TIR detection, we incorporated a fusion module into the backbone network to fuse images at the feature level, implicitly addressing calibration issues. Our proposed method achieved an mAP score of 0.516 and 0.543 on A and B benchmarks respectively while maintaining the highest inference speed among all models.
First Place Solution of 2023 Global Artificial Intelligence Technology Innovation Competition Track 1
Wu, Xiangyu, Zhang, Hailiang, Yang, Yang, Lu, Jianfeng
In this paper, we present our champion solution to the Global Artificial Intelligence Technology Innovation Competition Track 1: Medical Imaging Diagnosis Report Generation. We select CPT-BASE as our base model for the text generation task. During the pre-training stage, we delete the mask language modeling task of CPT-BASE and instead reconstruct the vocabulary, adopting a span mask strategy and gradually increasing the number of masking ratios to perform the denoising auto-encoder pre-training task. In the fine-tuning stage, we design iterative retrieval augmentation and noise-aware similarity bucket prompt strategies. The retrieval augmentation constructs a mini-knowledge base, enriching the input information of the model, while the similarity bucket further perceives the noise information within the mini-knowledge base, guiding the model to generate higher-quality diagnostic reports based on the similarity prompts. Surprisingly, our single model has achieved a score of 2.321 on leaderboard A, and the multiple model fusion scores are 2.362 and 2.320 on the A and B leaderboards respectively, securing first place in the rankings.
- North America > United States > California > San Diego County > San Diego (0.05)
- Asia > China > Ningxia Hui Autonomous Region > Yinchuan (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Health & Medicine > Therapeutic Area > Neurology (0.46)
- Health & Medicine > Diagnostic Medicine > Imaging (0.36)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
The Solution for The PST-KDD-2024 OAG-Challenge
Zhong, Shupeng, Li, Xinger, Jin, Shushan, Yang, Yang
In this paper, we introduce the second-place solution in the KDD-2024 OAG-Challenge paper source tracing track. Our solution is mainly based on two methods, BERT and GCN, and combines the reasoning results of BERT and GCN in the final submission to achieve complementary performance. In the BERT solution, we focus on processing the fragments that appear in the references of the paper, and use a variety of operations to reduce the redundant interference in the fragments, so that the information received by BERT is more refined. In the GCN solution, we map information such as paper fragments, abstracts, and titles to a high-dimensional semantic space through an embedding model, and try to build edges between titles, abstracts, and fragments to integrate contextual relationships for judgment. In the end, our solution achieved a remarkable score of 0.47691 in the competition.
- Asia > China > Jiangsu Province > Nanjing (0.06)
- North America > United States > District of Columbia > Washington (0.05)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > New York > New York County > New York City (0.04)
Complementary Fusion of Deep Network and Tree Model for ETA Prediction
Huang, YuRui, Zhang, Jie, Bao, HengDa, Yang, Yang, Yang, Jian
Estimated time of arrival (ETA) is a very important factor in the transportation system. It has attracted increasing attentions and has been widely used as a basic service in navigation systems and intelligent transportation systems. In this paper, we propose a novel solution to the ETA estimation problem, which is an ensemble on tree models and neural networks. We proved the accuracy and robustness of the solution on the A/B list and finally won first place in the SIGSPATIAL 2021 GISCUP competition.
- Asia > China > Beijing > Beijing (0.06)
- Asia > China > Jiangsu Province > Nanjing (0.05)
- North America > United States (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Transportation > Infrastructure & Services (0.70)
- Transportation > Ground > Road (0.70)
- Transportation > Passenger (0.47)
The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA
Zhang, Hailiang, Chao, Dian, Guan, Zhihao, Yang, Yang
In this paper, we introduce a grounded video question-answering solution. Our research reveals that the fixed official baseline method for video question answering involves two main steps: visual grounding and object tracking. However, a significant challenge emerges during the initial step, where selected frames may lack clearly identifiable target objects. Furthermore, single images cannot address questions like "Track the container from which the person pours the first time." To tackle this issue, we propose an alternative two-stage approach:(1) First, we leverage the VALOR model to answer questions based on video information.(2) concatenate the answered questions with their respective answers. Finally, we employ TubeDETR to generate bounding boxes for the targets.
Proposal Report for the 2nd SciCAP Competition 2024
Li, Pengpeng, Li, Tingmin, Wang, Jingyuan, Wang, Boyuan, Yang, Yang
In this paper, we propose a method for document summarization using auxiliary information. This approach effectively summarizes descriptions related to specific images, tables, and appendices within lengthy texts. Our experiments demonstrate that leveraging high-quality OCR data and initially extracted information from the original text enables efficient summarization of the content related to described objects. Based on these findings, we enhanced popular text generation model models by incorporating additional auxiliary branches to improve summarization performance. Our method achieved top scores of 4.33 and 4.66 in the long caption and short caption tracks, respectively, of the 2024 SciCAP competition, ranking highest in both categories.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Ningxia Hui Autonomous Region > Yinchuan (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)