AITopics | Yao, Linli

Collaborating Authors

Yao, Linli

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Temporal Reasoning Transfer from Text to Video

Li, Lei, Liu, Yuanxin, Yao, Linli, Zhang, Peiyuan, An, Chenxin, Wang, Lean, Sun, Xu, Kong, Lingpeng, Liu, Qi

arXiv.org Artificial IntelligenceOct-8-2024

Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the ineffective temporal encoding of visual inputs, our diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy. Surprisingly, we find that the key bottleneck in Video LLMs' temporal reasoning capability stems from the underlying LLM's inherent difficulty with temporal concepts, as evidenced by poor performance on textual temporal question-answering tasks. Building on this discovery, we introduce the Textual Temporal reasoning Transfer (T3). T3 synthesizes diverse temporal reasoning tasks in pure text format from existing image-text datasets, addressing the scarcity of video samples with complex temporal scenarios. Remarkably, without using any video data, T3 enhances LongVA-7B's temporal understanding, yielding a 5.3 absolute accuracy improvement on the challenging TempCompass benchmark, which enables our model to outperform ShareGPT4Video-8B trained on 28,000 video samples. Additionally, the enhanced LongVA-7B model achieves competitive performance on comprehensive video benchmarks. For example, it achieves a 49.7 accuracy on the Temporal Reasoning task of Video-MME, surpassing powerful large-scale models such as InternVL-Chat-V1.5-20B and VILA1.5-40B. Further analysis reveals a strong correlation between textual and video temporal task performance, validating the efficacy of transferring temporal reasoning abilities from text to video domains.

caption, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.06166

Country:

North America > United States > California (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Temporal Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

Mei, Yuting, Yao, Linli, Jin, Qin

arXiv.org Artificial IntelligenceJun-23-2024

With the surge in the amount of video data, video summarization techniques, including visual-modal(VM) and textual-modal(TM) summarization, are attracting more and more attention. However, unimodal summarization inevitably loses the rich semantics of the video. In this paper, we focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV). Specifically, we first construct a large-scale dataset, BIDS, in (video, VM-Summary, TM-Summary) triplet format. Unlike traditional processing methods, our construction procedure contains a VM-Summary extraction algorithm aiming to preserve the most salient content within long videos. Based on BIDS, we propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously. We further optimize our model with a list-wise ranking-based objective to improve its capacity to capture highlights. Lastly, we propose a metric, $NDCG_{MS}$, to provide a joint evaluation of the bimodal summary. Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines. Code and data are available at https://github.com/MeiYutingg/UBiSS.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3652583.3658038

2406.16301

Country:

Asia (0.71)
North America > United States (0.28)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Data Science (0.93)
(2 more...)

Add feedback

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Wang, Yuchi, Ren, Shuhuai, Gao, Rundong, Yao, Linli, Guo, Qingyan, An, Kaikai, Bai, Jianhong, Sun, Xu

arXiv.org Artificial IntelligenceApr-16-2024

Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation.

diffusion model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2404.10763

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Sports > Baseball (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)

Add feedback

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Ren, Shuhuai, Yao, Linli, Li, Shicheng, Sun, Xu, Hou, Lu

arXiv.org Artificial IntelligenceDec-4-2023

This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks, such as dense captioning, temporal grounding, and highlight detection, demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example, it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5) on Charades-STA, compared to state-of-the-art video large language models, holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2312.02051

Country: Europe > Netherlands (0.14)

Genre: Research Report (0.64)

Industry:

Consumer Products & Services (0.47)
Media > Film (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback