Li, Hongliang
Challenges and Trends in Egocentric Vision: A Survey
Li, Xiang, Qiu, Heqian, Wang, Lanxiao, Zhang, Hanwen, Qi, Chenghao, Han, Linfeng, Xiong, Huiyu, Li, Hongliang
With the rapid development of artificial intelligence technologies and wearable devices, egocentric vision understanding has emerged as a new and challenging research direction, gradually attracting widespread attention from both academia and industry. Egocentric vision captures visual and multimodal data through cameras or sensors worn on the human body, offering a unique perspective that simulates human visual experiences. This paper provides a comprehensive survey of the research on egocentric vision understanding, systematically analyzing the components of egocentric scenes and categorizing the tasks into four main areas: subject understanding, object understanding, environment understanding, and hybrid understanding. We explore in detail the sub-tasks within each category. We also summarize the main challenges and trends currently existing in the field. Furthermore, this paper presents an overview of high-quality egocentric vision datasets, offering valuable resources for future research. By summarizing the latest advancements, we anticipate the broad applications of egocentric vision technologies in fields such as augmented reality, virtual reality, and embodied intelligence, and propose future research directions based on the latest developments in the field.
Smaller But Better: Unifying Layout Generation with Smaller Large Language Models
Zhang, Peirong, Zhang, Jiaxin, Cao, Jiahuan, Li, Hongliang, Jin, Lianwen
We propose LGGPT, an LLM-based model tailored for unified layout generation. First, we propose Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as the uniform I/O template. ALI accommodates arbitrary layout generation task inputs across multiple layout domains, enabling LGGPT to unify both task-generic and domain-generic layout generation hitherto unexplored. Collectively, ALI and ULR boast a succinct structure that forgoes superfluous tokens typically found in existing HTML-based formats, facilitating efficient instruction tuning and boosting unified generation performance. In addition, we propose an Interval Quantization Encoding (IQE) strategy that compresses ALI into a more condensed structure. IQE precisely preserves valid layout clues while eliminating the less informative placeholders, facilitating LGGPT to capture complex and variable layout generation conditions during the unified training process. Experimental results demonstrate that LGGPT achieves superior or on par performance compared to existing methods. Notably, LGGPT strikes a prominent balance between proficiency and efficiency with a compact 1.5B parameter LLM, which beats prior 7B or 175B models even in the most extensive and challenging unified scenario. Furthermore, we underscore the necessity of employing LLMs for unified layout generation and suggest that 1.5B could be an optimal parameter size by comparing LLMs of varying scales. Code is available at https://github.com/NiceRingNode/LGGPT.
Towards Cost-Effective Reward Guided Text Generation
Rashid, Ahmad, Wu, Ruotian, Fan, Rongqi, Li, Hongliang, Kristiadi, Agustinus, Poupart, Pascal
Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training like in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to sub-optimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a \emph{single call} to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze various RGTG reward models and demonstrate that prior techniques prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods.
DesCLIP: Robust Continual Adaptation via General Attribute Descriptions for Pretrained Vision-Language Models
He, Chiyuan, Qiu, Zihuan, Meng, Fanman, Xu, Linfeng, Wu, Qingbo, Li, Hongliang
Continual adaptation of vision-language models (VLMs) focuses on leveraging cross-modal pretrained knowledge to incrementally adapt for expanding downstream tasks and datasets, while tackling the challenge of knowledge forgetting. Existing research often focuses on connecting visual features with specific class text in downstream tasks, overlooking the latent relationships between general and specialized knowledge. Our findings reveal that forcing models to optimize inappropriate visual-text matches exacerbates forgetting of VLMs. To tackle this issue, we propose DesCLIP, which leverages general attribute (GA) descriptions to guide the understanding of specific class objects, enabling VLMs to establish robust \textit{vision-GA-class} trilateral associations rather than relying solely on \textit{vision-class} connections. Specifically, we introduce a language assistant to generate concrete GA description candidates via proper request prompts. Then, an anchor-based embedding filter is designed to obtain highly relevant GA description embeddings, which are leveraged as the paired text embeddings for visual-textual instance matching, thereby tuning the visual encoder. Correspondingly, the class text embeddings are gradually calibrated to align with these shared GA description embeddings. Extensive experiments demonstrate the advancements and efficacy of our proposed method, with comprehensive empirical evaluations highlighting its superior performance compared to existing pretrained and VLM-based continual learning methods.
Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt
Lin, Xingtao, Qiu, Heqian, Wang, Lanxiao, Wang, Ruihang, Xu, Linfeng, Li, Hongliang
Recent advancements in prompt tuning have successfully adapted large-scale models like Contrastive Language-Image Pre-trained (CLIP) for downstream tasks such as scene text detection. Typically, text prompt complements the text encoder's input, focusing on global features while neglecting fine-grained details, leading to fine-grained text being ignored in task of scene text detection. In this paper, we propose the region prompt tuning (RPT) method for fine-grained scene text detection, where region text prompt proposed would help focus on fine-grained features. Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens, creating a one-to-one correspondence between characters and tokens. This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text. To achieve this, we introduce a sharing position embedding to link each character with its corresponding token and employ a bidirectional distance loss to align each region text prompt character with the target ``text''. To refine the information at fine-grained level, we implement character-token level interactions before and after encoding. Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching, producing a final score map that could balance the global and local features and be fed into DBNet to detect the text. Experiments on benchmarks like ICDAR2015, TotalText, and CTW1500 demonstrate RPT impressive performance, underscoring its effectiveness for scene text detection.
HiHa: Introducing Hierarchical Harmonic Decomposition to Implicit Neural Compression for Atmospheric Data
Xu, Zhewen, Pan, Baoxiang, Li, Hongliang, Wei, Xiaohui
The rapid development of large climate models has created the requirement of storing and transferring massive atmospheric data worldwide. Therefore, data compression is essential for meteorological research, but an efficient compression scheme capable of keeping high accuracy with high compressibility is still lacking. As an emerging technique, Implicit Neural Representation (INR) has recently acquired impressive momentum and demonstrates high promise for compressing diverse natural data. However, the INR-based compression encounters a bottleneck due to the sophisticated spatio-temporal properties and variability. To address this issue, we propose Hierarchical Harmonic decomposition implicit neural compression (HiHa) for atmospheric data. HiHa firstly segments the data into multi-frequency signals through decomposition of multiple complex harmonic, and then tackles each harmonic respectively with a frequency-based hierarchical compression module consisting of sparse storage, multi-scale INR and iterative decomposition sub-modules. We additionally design a temporal residual compression module to accelerate compression by utilizing temporal continuity. Experiments depict that HiHa outperforms both mainstream compressors and other INR-based methods in both compression fidelity and capabilities, and also demonstrate that using compressed data in existing data-driven models can achieve the same accuracy as raw data.
On the Hidden Mystery of OCR in Large Multimodal Models
Liu, Yuliang, Li, Zhang, Li, Hongliang, Yu, Wenwen, Liu, Yang, Yang, Biao, Huang, Mingxin, Peng, Dezhi, Liu, Mingyu, Chen, Mingrui, Li, Chunyuan, Yin, Xucheng, Liu, Cheng-lin, Jin, Lianwen, Bai, Xiang
Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. It remains less explored about their efficacy in text-related visual tasks. We conducted a comprehensive study of existing publicly available multimodal models, evaluating their performance in text recognition (document text, artistic text, handwritten text, scene text), text-based visual question answering (document text, scene text, and bilingual text), key information extraction (receipts, documents, and nutrition facts) and handwritten mathematical expression recognition. Our findings reveal strengths and weaknesses in these models, which primarily rely on semantic understanding for word recognition and exhibit inferior perception of individual character shapes. They also display indifference towards text length and have limited capabilities in detecting finegrained features in images. Consequently, these results demonstrate that even the current most powerful large multimodal models cannot match domain-specific methods in traditional text tasks and face greater challenges in more complex tasks. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. Evaluation pipeline is available at https://github.com/Yuliang-Liu/MultimodalOCR.
Deep Learning in Ultrasound Elastography Imaging
Li, Hongliang, Bhatt, Manish, Qu, Zhen, Zhang, Shiming, Hartel, Martin C., Khademhosseini, Ali, Cloutier, Guy
It is known that changes in the mechanical properties of tissues are associated with the onset and progression of certain diseases. Ultrasound elastography is a technique to characterize tissue stiffness using ultrasound imaging either by measuring tissue strain using quasi-static elastography or natural organ pulsation elastography, or by tracing a propagated shear wave induced by a source or a natural vibration using dynamic elastography. In recent years, deep learning has begun to emerge in ultrasound elastography research. In this review, several common deep learning frameworks in the computer vision community, such as multilayer perceptron, convolutional neural network, and recurrent neural network are described. Then, recent advances in ultrasound elastography using such deep learning techniques are revisited in terms of algorithm development and clinical diagnosis. Finally, the current challenges and future developments of deep learning in ultrasound elastography are prospected.