AITopics | Radhakrishnan, Subhashree

Collaborating Authors

Radhakrishnan, Subhashree

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

Li, Zhiqi, Chen, Guo, Liu, Shilong, Wang, Shihao, VS, Vibashan, Ji, Yishen, Lan, Shiyi, Zhang, Hao, Zhao, Yilin, Radhakrishnan, Subhashree, Chang, Nadine, Sapra, Karan, Deshmukh, Amala Sanjay, Rintamaki, Tuomas, Le, Matthieu, Karmanov, Ilia, Voegtle, Lukas, Fischer, Philipp, Huang, De-An, Roman, Timo, Lu, Tong, Alvarez, Jose M., Catanzaro, Bryan, Kautz, Jan, Tao, Andrew, Liu, Guilin, Yu, Zhiding

arXiv.org Artificial IntelligenceJan-20-2025

Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, showing the key role of data strategy in developing frontier VLMs. By studying and building our post-training data strategy from scratch, we share detailed insights into the development processes, aiming to benefit the development of competitive models for the open-source community. Our introduced data strategy, together with training recipes and model design, leads to a family of performant VLMs named Eagle2. Specifically, Eagle2-9B achieves state-of-the-art results across various multimodal benchmarks, matching certain competitive models with up to 70B parameters.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2501.14818

Country: Asia (0.14)

Genre: Research Report > New Finding (0.45)

Industry: Education (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

LITA: Language Instructed Temporal-Localization Assistant

Huang, De-An, Liao, Shijia, Radhakrishnan, Subhashree, Yin, Hongxu, Molchanov, Pavlo, Yu, Zhiding, Kautz, Jan

arXiv.org Artificial IntelligenceMar-27-2024

There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding. Code is available at: https://github.com/NVlabs/LITA

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2403.19046

Genre: Research Report (0.82)

Industry: Health & Medicine > Consumer Health (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback