Ji, Yatai
CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space
Zhao, Yong, Xu, Kai, Zhu, Zhengqiu, Hu, Yue, Zheng, Zhiheng, Chen, Yingfeng, Ji, Yatai, Gao, Chen, Li, Yong, Huang, Jincai
Embodied Question Answering (EQA) has primarily focused on indoor environments, leaving the complexities of urban settings - spanning environment, action, and perception - largely unexplored. To bridge this gap, we introduce CityEQA, a new task where an embodied agent answers open-vocabulary questions through active exploration in dynamic city spaces. To support this task, we present CityEQA-EC, the first benchmark dataset featuring 1,412 human-annotated tasks across six categories, grounded in a realistic 3D urban simulator. Moreover, we propose Planner-Manager-Actor (PMA), a novel agent tailored for CityEQA. PMA enables long-horizon planning and hierarchical task execution: the Planner breaks down the question answering into sub-tasks, the Manager maintains an object-centric cognitive map for spatial reasoning during the process control, and the specialized Actors handle navigation, exploration, and collection sub-tasks. Experiments demonstrate that PMA achieves 60.7% of human-level answering accuracy, significantly outperforming frontier-based baselines. While promising, the performance gap compared to humans highlights the need for enhanced visual reasoning in CityEQA. This work paves the way for future advancements in urban spatial intelligence. Dataset and code are available at https://github.com/BiluYong/CityEQA.git.
AutoS$^2$earch: Unlocking the Reasoning Potential of Large Models for Web-based Source Search
Zhu, Zhengqiu, Ji, Yatai, Huang, Jiaheng, Zhao, Yong, Qiu, Sihang, Ju, Rusheng
Web-based management systems have been widely used in risk control and industrial safety. However, effectively integrating source search capabilities into these systems, to enable decision-makers to locate and address the hazard (e.g., gas leak detection) remains a challenge. While prior efforts have explored using web crowdsourcing and AI algorithms for source search decision support, these approaches suffer from overheads in recruiting human participants and slow response times in time-sensitive situations. To address this, we introduce AutoS$^2$earch, a novel framework leveraging large models for zero-shot source search in web applications. AutoS$^2$earch operates on a simplified visual environment projected through a web-based display, utilizing a chain-of-thought prompt designed to emulate human reasoning. The multi-modal large language model (MLLMs) dynamically converts visual observations into language descriptions, enabling the LLM to perform linguistic reasoning on four directional choices. Extensive experiments demonstrate that AutoS$^2$earch achieves performance nearly equivalent to human-AI collaborative source search while eliminating dependency on crowdsourced labor. Our work offers valuable insights in using web engineering to design such autonomous systems in other industrial applications.
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
Ji, Yatai, Zhang, Jiacheng, Wu, Jie, Zhang, Shilong, Chen, Shoufa, GE, Chongjian, Sun, Peize, Chen, Weifeng, Shao, Wenqi, Xiao, Xuefeng, Huang, Weilin, Luo, Ping
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs, where the textual prompts play a pivotal role in determining quality of output videos. However, achieving the desired output often entails multiple revisions and iterative inference to refine user-provided prompts. Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models. To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model. Our approach involves a meticulously crafted two-stage optimization and alignment system. Initially, we conduct a reward-guided prompt evolution pipeline to automatically create optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the LLM. Then multi-dimensional rewards are employed to generate pairwise data for the SFT model, followed by the direct preference optimization (DPO) algorithm to further facilitate preference alignment. Through extensive experimentation and comparative analyses, we validate the effectiveness of Prompt-A-Video across diverse generation models, highlighting its potential to push the boundaries of video generation.
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Ji, Yatai, Zhang, Shilong, Wu, Jie, Sun, Peize, Chen, Weifeng, Xiao, Xuefeng, Yang, Sidi, Yang, Yujiu, Luo, Ping
The rapid advancement of Large Vision-Language models (LVLMs) has demonstrated a spectrum of emergent capabilities. Nevertheless, current models only focus on the visual content of a single scenario, while their ability to associate instances across different scenes has not yet been explored, which is essential for understanding complex visual content, such as movies with multiple characters and intricate plots. Towards movie understanding, a critical initial step for LVLMs is to unleash the potential of character identities memory and recognition across multiple visual scenarios. To achieve the goal, we propose visual instruction tuning with ID reference and develop an ID-Aware Large Vision-Language Model, IDA-VLM. Furthermore, our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions: matching, location, question-answering, and captioning. Our findings highlight the limitations of existing LVLMs in recognizing and associating instance identities with ID reference. This paper paves the way for future artificial intelligence systems to possess multi-identity visual inputs, thereby facilitating the comprehension of complex visual narratives like movies.
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Wang, Junjie, Zhang, Yin, Ji, Yatai, Zhang, Yuxiang, Jiang, Chunyang, Wang, Yubo, Zhu, Kang, Wang, Zekun, Wang, Tiezhen, Huang, Wenhao, Fu, Jie, Chen, Bei, Lin, Qunshu, Liu, Minghao, Zhang, Ge, Chen, Wenhu
Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. Addressing these issues, we introduce a novel dataset format, PIN (Paired and INterleaved multimodal documents), designed to significantly improve both the depth and breadth of multimodal training. The PIN format is built on three foundational principles: knowledge intensity, scalability, and support for diverse training modalities. This innovative format combines markdown files and comprehensive images to enrich training data with a dense knowledge structure and versatile training strategies. We present PIN-14M, an open-source dataset comprising 14 million samples derived from a diverse range of Chinese and English sources, tailored to include complex web and scientific content. This dataset is constructed meticulously to ensure data quality and ethical integrity, aiming to facilitate advanced training strategies and improve model robustness against common multimodal training pitfalls. Our initial results, forming the basis of this technical report, suggest significant potential for the PIN format in refining LMM performance, with plans for future expansions and detailed evaluations of its impact on model capabilities.
Taming Lookup Tables for Efficient Image Retouching
Yang, Sidi, Huang, Binxiao, Cao, Mingdeng, Ji, Yatai, Guo, Hanzhong, Wong, Ngai, Yang, Yujiu
The widespread use of high-definition screens in edge devices, such as end-user cameras, smartphones, and televisions, is spurring a significant demand for image enhancement. Existing enhancement models often optimize for high performance while falling short of reducing hardware inference time and power consumption, especially on edge devices with constrained computing and storage resources. To this end, we propose Image Color Enhancement Lookup Table (ICELUT) that adopts LUTs for extremely efficient edge inference, without any convolutional neural network (CNN). During training, we leverage pointwise (1x1) convolution to extract color information, alongside a split fully connected layer to incorporate global information. Both components are then seamlessly converted into LUTs for hardware-agnostic deployment. ICELUT achieves near-state-of-the-art performance and remarkably low power consumption. We observe that the pointwise network structure exhibits robust scalability, upkeeping the performance even with a heavily downsampled 32x32 input image. These enable ICELUT, the first-ever purely LUT-based image enhancer, to reach an unprecedented speed of 0.4ms on GPU and 7ms on CPU, at least one order faster than any CNN solution. Codes are available at https://github.com/Stephen0808/ICELUT.
Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models
Chen, Weifeng, Ji, Yatai, Wu, Jie, Wu, Hefeng, Xie, Pan, Li, Jiashi, Xia, Xin, Xiao, Xuefeng, Lin, Liang
Recent advancements in diffusion models have unlocked unprecedented abilities in visual creation. However, current text-to-video generation models struggle with the trade-off among movement range, action coherence and object consistency. To mitigate this issue, we present a controllable text-to-video (T2V) diffusion model, called Control-A-Video, capable of maintaining consistency while customizable video synthesis. Based on a pre-trained conditional text-to-image (T2I) diffusion model, our model aims to generate videos conditioned on a sequence of control signals, such as edge or depth maps. For the purpose of improving object consistency, Control-A-Video integrates motion priors and content priors into video generation. We propose two motion-adaptive noise initialization strategies, which are based on pixel residual and optical flow, to introduce motion priors from input videos, producing more coherent videos. Moreover, a first-frame conditioned controller is proposed to generate videos from content priors of the first frame, which facilitates the semantic alignment with text and allows longer video generation in an auto-regressive manner. With the proposed architecture and strategies, our model achieves resource-efficient convergence and generate consistent and coherent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality.
Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection
Xiao, Yicheng, Luo, Zhuoyan, Liu, Yong, Ma, Yue, Bian, Hengwei, Ji, Yatai, Yang, Yujiu, Li, Xiu
Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis. Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture. However, we observe that the emphasis of MR and HD differs, with one necessitating the perception of local relationships and the other prioritizing the understanding of global contexts. Consequently, the lack of task-specific design will inevitably lead to limitations in associating the intrinsic specialty of two tasks. To tackle the issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively. By performing progressive integration on intra and inter-modality across multi-granularity, UVCOM achieves the comprehensive understanding in processing a video. Moreover, we present multi-aspect contrastive learning to consolidate the local relation modeling and global knowledge accumulation via well aligned multi-modal space. Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which outperforms the state-of-the-art methods by a remarkable margin.
Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition
Ni, Xinzhe, Liu, Yong, Wen, Hao, Ji, Yatai, Xiao, Jing, Yang, Yujiu
Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet, which demonstrates the importance of prototypes. Although they achieve relatively good performance, the effect of multimodal information is ignored, e.g. label texts. In this work, we propose a novel MultimOdal PRototype-ENhanced Network (MORN), which uses the semantic information of label texts as multimodal information to enhance prototypes. A CLIP visual encoder and a frozen CLIP text encoder are introduced to obtain features with good multimodal initialization. Then in the visual flow, visual prototypes are computed by a Temporal-Relational CrossTransformer (TRX) module for example. In the text flow, a semantic-enhanced (SE) module and an inflating operation are used to obtain text prototypes. The final multimodal prototypes are then computed by a multimodal prototype-enhanced (MPE) module. Besides, we define a PRototype SImilarity DiffErence (PRIDE) to evaluate the quality of prototypes, which is used to verify our improvement on the prototype level and effectiveness of MORN. We conduct extensive experiments on four popular datasets, and MORN achieves state-of-the-art results on HMDB51, UCF101, Kinetics and SSv2. When plugging PRIDE into the training stage, the performance can be further improved.
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Ji, Yatai, Wang, Junjie, Gong, Yuan, Zhang, Lin, Zhu, Yanru, Wang, Hongfa, Zhang, Jiaxing, Sakai, Tetsuya, Yang, Yujiu
Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our interpretation, including inter- and intra-modal uncertainty. Little effort has studied the modeling of this uncertainty, particularly in pre-training on unlabeled datasets and fine-tuning in task-specific downstream datasets. In this paper, we project the representations of all modalities as probabilistic distributions via a Probability Distribution Encoder (PDE) by utilizing sequence-level interactions. Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information and more complex relationships. Furthermore, we integrate uncertainty modeling with popular pre-training frameworks and propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). The fine-tuned models are applied to challenging downstream tasks, including image-text retrieval, visual question answering, visual reasoning, and visual entailment, and achieve state-of-the-art results.