AITopics | Wan, Zihao

Collaborating Authors

Wan, Zihao

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding

Lin, Junming, Fang, Zheng, Chen, Chi, Wan, Zihao, Luo, Fuwen, Li, Peng, Liu, Yang, Sun, Maosong

arXiv.org Artificial IntelligenceNov-5-2024

The rapid development of Multimodal Large Language Models (MLLMs) has expanded their capabilities from image comprehension to video understanding. However, most of these MLLMs focus primarily on offline video comprehension, necessitating extensive processing of all video frames before any queries can be made. This presents a significant gap compared to the human ability to watch, listen, think, and respond to streaming inputs in real time, highlighting the limitations of current MLLMs. In this paper, we introduce StreamingBench, the first comprehensive benchmark designed to evaluate the streaming video understanding capabilities of MLLMs. The benchmark consists of 18 tasks, featuring 900 videos and 4,500 human-curated QA pairs. Each video features five questions presented at different time points to simulate a continuous streaming scenario. We conduct experiments on StreamingBench with 13 open-source and proprietary MLLMs and find that even the most advanced proprietary MLLMs like Gemini 1.5 Pro and GPT-4o perform significantly below human-level streaming video understanding capabilities. We hope our work can facilitate further advancements for MLLMs, empowering them to approach human-level video comprehension and interaction in more realistic scenarios. The rapid evolution of Multimodal Large Language Models (MLLMs) has significantly reshaped the field of Artificial Intelligence (Yang et al., 2023; Reid et al., 2024; Liu et al., 2024c;a). Current advanced MLLMs (Reid et al., 2024; Wang et al., 2024a; Yao et al., 2024) have already demonstrated exceptional performance in video understanding tasks, excelling on existing video benchmarks (Fu et al., 2024; Wang et al., 2024b; Zhou et al., 2024; Ataallah et al., 2024). Moreover, several pioneering studies (Chen et al., 2024a; Zhang et al., 2024a; Wu et al., 2024) have focused on improving the ability of MLLMs to comprehend real-time online video streams, pushing the boundaries of their applicability and efficiency in dynamic environments. In the industry, streaming video understanding has also attracted significant attention, with OpenAI's GPT-4o (OpenAI, 2024) as a prominent example that demonstrates human-like perception and understanding of streaming inputs. Despite the recognition of the importance of streaming video understanding for MLLMs, most existing video understanding benchmarks (Fu et al., 2024; Wang et al., 2024b; Zhou et al., 2024) are In offline video benchmarks, questions are designed based on the entire video being visible. In contrast, StreamingBench presents questions at specific moments, with three main task categories specifically designed to evaluate fundamental capabilities in streaming video understanding.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2411.03628

Genre: Research Report > New Finding (0.46)

Industry:

Media (0.93)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision > Video Understanding (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.45)

Add feedback

Results of the Big ANN: NeurIPS'23 competition

Simhadri, Harsha Vardhan, Aumüller, Martin, Ingber, Amir, Douze, Matthijs, Williams, George, Manohar, Magdalen Dobson, Baranchuk, Dmitry, Liberty, Edo, Liu, Frank, Landrum, Ben, Karjikar, Mazin, Dhulipala, Laxman, Chen, Meng, Chen, Yue, Ma, Rui, Zhang, Kai, Cai, Yuzheng, Shi, Jiayang, Chen, Yizhuo, Zheng, Weiguo, Wan, Zihao, Yin, Jie, Huang, Ben

arXiv.org Artificial IntelligenceSep-25-2024

The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search ~\cite{DBLP:conf/nips/SimhadriWADBBCH21}, this competition addressed filtered search, out-of-distribution data, sparse and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources. The results showcased significant improvements in search accuracy and efficiency over industry-standard baselines, with notable contributions from both academic and industrial teams. This paper summarizes the competition tracks, datasets, evaluation metrics, and the innovative approaches of the top-performing submissions, providing insights into the current advancements and future directions in the field of approximate nearest neighbor search.

artificial intelligence, information retrieval, natural language, (21 more...)

arXiv.org Artificial Intelligence

2409.17424

Country:

Europe (0.28)
North America > Canada (0.14)

Genre: Research Report > Promising Solution (0.54)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.67)

Add feedback

CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models

Luo, Fuwen, Chen, Chi, Wan, Zihao, Kang, Zhaolu, Yan, Qidong, Li, Yingjie, Wang, Xiaolong, Wang, Siyu, Wang, Ziyue, Mi, Xiaoyue, Li, Peng, Ma, Ning, Sun, Maosong, Liu, Yang

arXiv.org Artificial IntelligenceJun-4-2024

Multimodal large language models (MLLMs) have demonstrated promising results in a variety of tasks that combine vision and language. As these models become more integral to research and applications, conducting comprehensive evaluations of their capabilities has grown increasingly important. However, most existing benchmarks fail to consider that, in certain situations, images need to be interpreted within a broader context. In this work, we introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. Further analysis confirms that these models struggle to effectively extract and utilize contextual information to improve their understanding of images. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner. View our project website at https://thunlp-mt.github.io/CODIS.

large language model, llava-1, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2402.13607

Country:

Asia > China (0.46)
Europe > United Kingdom (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Transportation (0.95)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback