Gao, Jianjun
CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models
Gao, Jianjun, Cai, Chen, Wang, Ruoyu, Liu, Wenyang, Yap, Kim-Hui, Garg, Kratika, Han, Boon-Siew
Human-object interaction (HOI) detection has seen advancements with Vision Language Models (VLMs), but these methods often depend on extensive manual annotations. Vision Large Language Models (VLLMs) can inherently recognize and reason about interactions at the image level but are computationally heavy and not designed for instance-level HOI detection. To overcome these limitations, we propose a Cross-Level HOI distillation (CL-HOI) framework, which distills instance-level HOIs from VLLMs image-level understanding without the need for manual annotations. Our approach involves two stages: context distillation, where a Visual Linguistic Translator (VLT) converts visual information into linguistic form, and interaction distillation, where an Interaction Cognition Network (ICN) reasons about spatial, visual, and context relations. We design contrastive distillation losses to transfer image-level context and interaction knowledge from the teacher to the student model, enabling instance-level HOI detection. Evaluations on HICO-DET and V-COCO datasets demonstrate that our CL-HOI surpasses existing weakly supervised methods and VLLM supervised methods, showing its efficacy in detecting HOIs without manual labels.
Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting
Cai, Chen, Wang, Zheng, Gao, Jianjun, Liu, Wenyang, Lu, Ye, Zhang, Runzhong, Yap, Kim-Hui
In recent years, the rapid increase in online video content has underscored the limitations of static Video Question Answering (VideoQA) models trained on fixed datasets, as they struggle to adapt to new questions or tasks posed by newly available content. In this paper, we explore the novel challenge of VideoQA within a continual learning framework, and empirically identify a critical issue: fine-tuning a large language model (LLM) for a sequence of tasks often results in catastrophic forgetting. To address this, we propose Collaborative Prompting (ColPro), which integrates specific question constraint prompting, knowledge acquisition prompting, and visual temporal awareness prompting. These prompts aim to capture textual question context, visual content, and video temporal dynamics in VideoQA, a perspective underexplored in prior research. Experimental results on the NExT-QA and DramaQA datasets show that ColPro achieves superior performance compared to existing approaches, achieving 55.14\% accuracy on NExT-QA and 71.24\% accuracy on DramaQA, highlighting its practical relevance and effectiveness.
CM2-Net: Continual Cross-Modal Mapping Network for Driver Action Recognition
Wang, Ruoyu, Cai, Chen, Wang, Wenqian, Gao, Jianjun, Lin, Dan, Liu, Wenyang, Yap, Kim-Hui
Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested independently learning each non-RGB modality by fine-tuning a model pre-trained on RGB videos, but these methods are less effective in extracting informative features when faced with newly-incoming modalities due to large domain gaps. In contrast, we propose a Continual Cross-Modal Mapping Network (CM2-Net) to continually learn each newly-incoming modality with instructive prompts from the previously-learned modalities. Specifically, we have developed Accumulative Cross-modal Mapping Prompting (ACMP), to map the discriminative and informative features learned from previous modalities into the feature space of newly-incoming modalities. Then, when faced with newly-incoming modalities, these mapped features are able to provide effective prompts for which features should be extracted and prioritized. These prompts are accumulating throughout the continual learning process, thereby boosting further recognition performances. Extensive experiments conducted on the Drive&Act dataset demonstrate the performance superiority of CM2-Net on both uni- and multi-modal driver action recognition.
Video sentence grounding with temporally global textual knowledge
Chen, Cai, Zhang, Runzhong, Gao, Jianjun, Wu, Kejun, Yap, Kim-Hui, Wang, Yi
Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.
Price Interpretability of Prediction Markets: A Convergence Analysis
Yu, Dian, Gao, Jianjun, Wu, Weiping, Wang, Zizhuo
Prediction markets are long known for prediction accuracy. This study systematically explores the fundamental properties of prediction markets, addressing questions about their information aggregation process and the factors contributing to their remarkable efficacy. We propose a novel multivariate utility (MU) based mechanism that unifies several existing automated market-making schemes. Using this mechanism, we establish the convergence results for markets comprised of risk-averse traders who have heterogeneous beliefs and repeatedly interact with the market maker. We demonstrate that the resulting limiting wealth distribution aligns with the Pareto efficient frontier defined by the utilities of all market participants. With the help of this result, we establish analytical and numerical results for the limiting price in different market models. Specifically, we show that the limiting price converges to the geometric mean of agent beliefs in exponential utility-based markets. In risk-measure-based markets, we construct a family of risk measures that satisfy the convergence criteria and prove that the price can converge to a unique level represented by the weighted power mean of agent beliefs. In broader markets with Constant Relative Risk Aversion (CRRA) utilities, we reveal that the limiting price can be characterized by systems of equations that encapsulate agent beliefs, risk parameters, and wealth. Despite the potential impact of traders' trading sequences on the limiting price, we establish a price invariance result for markets with a large trader population. Using this result, we propose an efficient approximation scheme for the limiting price.
OccluTrack: Rethinking Awareness of Occlusion for Enhancing Multiple Pedestrian Tracking
Gao, Jianjun, Wang, Yi, Yap, Kim-Hui, Garg, Kratika, Han, Boon Siew
Multiple pedestrian tracking faces the challenge of tracking pedestrians in the presence of occlusion. Existing methods suffer from inaccurate motion estimation, appearance feature extraction, and association due to occlusion, leading to inadequate Identification F1-Score (IDF1), excessive ID switches (IDSw), and insufficient association accuracy and recall (AssA and AssR). We found that the main reason is abnormal detections caused by partial occlusion. In this paper, we suggest that the key insight is explicit motion estimation, reliable appearance features, and fair association in occlusion scenes. Specifically, we propose an adaptive occlusion-aware multiple pedestrian tracker, OccluTrack. We first introduce an abnormal motion suppression mechanism into the Kalman Filter to adaptively detect and suppress outlier motions caused by partial occlusion. Second, we propose a pose-guided re-ID module to extract discriminative part features for partially occluded pedestrians. Last, we design a new occlusion-aware association method towards fair IoU and appearance embedding distance measurement for occluded pedestrians. Extensive evaluation results demonstrate that our OccluTrack outperforms state-of-the-art methods on MOT-Challenge datasets. Particularly, the improvements on IDF1, IDSw, AssA, and AssR demonstrate the effectiveness of our OccluTrack on tracking and association performance.