Not enough data to create a plot.
Try a different view from the menu above.
Niebles, Juan Carlos
ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
Zhang, Jianguo, Hoang, Thai, Zhu, Ming, Liu, Zuxin, Wang, Shiyu, Awalgaonkar, Tulika, Prabhakar, Akshara, Chen, Haolin, Yao, Weiran, Liu, Zhiwei, Tan, Juntao, Niebles, Juan Carlos, Heinecke, Shelby, Wang, Huan, Savarese, Silvio, Xiong, Caiming
Action models are essential for enabling autonomous agents to perform complex tasks. However, training large action models remains challenging due to the diversity of agent environments and the complexity of agentic data. Despite growing interest, existing infrastructure provides limited support for scalable, agent-specific fine-tuning. We present ActionStudio, a lightweight and extensible data and training framework designed for large action models. ActionStudio unifies heterogeneous agent trajectories through a standardized format, supports diverse training paradigms including LoRA, full fine-tuning, and distributed setups, and integrates robust preprocessing and verification tools. We validate its effectiveness across both public and realistic industry benchmarks, demonstrating strong performance and practical scalability. We open-sourced code and data at https://github.com/SalesforceAIResearch/xLAM to facilitate research in the community.
Unifying Specialized Visual Encoders for Video Language Models
Chung, Jihoon, Zhu, Tyler, Saez-Diez, Max Gonzalez, Niebles, Juan Carlos, Zhou, Honglu, Russakovsky, Olga
The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works. MERV is up to 3.7% better in accuracy than Video-LLaVA across the standard suite video understanding benchmarks, while also having a better Video-ChatGPT score. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2.2%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder methods while parallelizing the visual processing. Finally, we provide qualitative evidence that MERV successfully captures domain knowledge from each of its encoders. Our results offer promising directions in utilizing multiple vision encoders for comprehensive video understanding.
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Zhang, Jieyu, Xue, Le, Song, Linxin, Wang, Jun, Huang, Weikai, Shu, Manli, Yan, An, Ma, Zixian, Niebles, Juan Carlos, Savarese, Silvio, Xiong, Caiming, Chen, Zeyuan, Krishna, Ranjay, Xu, Ran
With the rise of multimodal applications, instruction data has become critical for training multimodal language models capable of understanding complex image-based queries. Existing practices rely on powerful but costly large language models (LLMs) or multimodal language models (MLMs) to produce instruction data. These are often prone to hallucinations, licensing issues and the generation process is often hard to scale and interpret. In this work, we present a programmatic approach that employs scene graphs as symbolic representations of images and human-written programs to systematically synthesize vision-centric instruction data. Our approach ensures the interpretability and controllability of the data generation process and scales efficiently while maintaining factual accuracy. By implementing a suite of 24 single-image, 14 multi-image instruction generators, and a scene graph generation pipeline, we build a scalable, cost-effective system: ProVision which produces diverse question-answer pairs concerning objects, attributes, relations, depth, etc., for any given image. Applied to Visual Genome and DataComp datasets, we generate over 10 million instruction data points, ProVision-10M, and leverage them in both pretraining and instruction tuning stages of MLMs. When adopted in the instruction tuning stage, our single-image instruction data yields up to a 7% improvement on the 2D split and 8% on the 3D split of CVBench, along with a 3% increase in performance on QBench2, RealWorldQA, and MMMU. Our multi-image instruction data leads to an 8% improvement on Mantis-Eval. Incorporation of our data in both pre-training and fine-tuning stages of xGen-MM-4B leads to an averaged improvement of 1.6% across 11 benchmarks.
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs
Kokane, Shirley, Zhu, Ming, Awalgaonkar, Tulika, Zhang, Jianguo, Hoang, Thai, Prabhakar, Akshara, Liu, Zuxin, Lan, Tian, Yang, Liangwei, Tan, Juntao, Murthy, Rithesh, Yao, Weiran, Liu, Zhiwei, Niebles, Juan Carlos, Wang, Huan, Heinecke, Shelby, Xiong, Caiming, Savarese, Silivo
Evaluating the output of Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce SpecTool, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using SPECTOOL , we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos
Liu, Yunong, Eyzaguirre, Cristobal, Li, Manling, Khanna, Shubh, Niebles, Juan Carlos, Ravi, Vineeth, Mishra, Saumitra, Liu, Weiyu, Wu, Jiajun
Shape assembly is a ubiquitous task in daily life, integral for constructing complex 3D structures like IKEA furniture. While significant progress has been made in developing autonomous agents for shape assembly, existing datasets have not yet tackled the 4D grounding of assembly instructions in videos, essential for a holistic understanding of assembly in 3D space over time. We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense spatio-temporal alignments between these data modalities. To demonstrate the utility of IKEA Video Manuals, we present five applications essential for shape assembly: assembly plan generation, part-conditioned segmentation, part-conditioned pose estimation, video object segmentation, and furniture assembly based on instructional video manuals. For each application, we provide evaluation metrics and baseline methods. Through experiments on our annotated data, we highlight many challenges in grounding assembly instructions in videos to improve shape assembly, including handling occlusions, varying viewpoints, and extended assembly sequences.
PRACT: Optimizing Principled Reasoning and Acting of LLM Agent
Liu, Zhiwei, Yao, Weiran, Zhang, Jianguo, Murthy, Rithesh, Yang, Liangwei, Liu, Zuxin, Lan, Tian, Zhu, Ming, Tan, Juntao, Kokane, Shirley, Hoang, Thai, Niebles, Juan Carlos, Heinecke, Shelby, Wang, Huan, Savarese, Silvio, Xiong, Caiming
We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. Central to our approach is the use of text gradients from a reflection and optimization engine to derive these action principles. To adapt action principles to specific task requirements, we propose a new optimization framework, Reflective Principle Optimization (RPO). After execution, RPO employs a reflector to critique current action principles and an optimizer to update them accordingly. We develop the RPO framework under two scenarios: Reward-RPO, which uses environmental rewards for reflection, and Self-RPO, which conducts self-reflection without external rewards. Additionally, two RPO methods, RPO-Traj and RPO-Batch, is introduced to adapt to different settings. Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, effectively learns and applies action principles to enhance performance.
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Ryoo, Michael S., Zhou, Honglu, Kendre, Shrikant, Qin, Can, Xue, Le, Shu, Manli, Savarese, Silvio, Xu, Ran, Xiong, Caiming, Niebles, Juan Carlos
We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
Liu, Zuxin, Hoang, Thai, Zhang, Jianguo, Zhu, Ming, Lan, Tian, Kokane, Shirley, Tan, Juntao, Yao, Weiran, Liu, Zhiwei, Feng, Yihao, Murthy, Rithesh, Yang, Liangwei, Savarese, Silvio, Niebles, Juan Carlos, Wang, Huan, Heinecke, Shelby, Xiong, Caiming
The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains.
Artificial Intelligence Index Report 2024
Maslej, Nestor, Fattorini, Loredana, Perrault, Raymond, Parli, Vanessa, Reuel, Anka, Brynjolfsson, Erik, Etchemendy, John, Ligett, Katrina, Lyons, Terah, Manyika, James, Niebles, Juan Carlos, Shoham, Yoav, Wald, Russell, Clark, Jack
The 2024 Index is our most comprehensive to date and arrives at an important moment when AI's influence on society has never been more pronounced. This year, we have broadened our scope to more extensively cover essential trends such as technical advancements in AI, public perceptions of the technology, and the geopolitical dynamics surrounding its development. Featuring more original data than ever before, this edition introduces new estimates on AI training costs, detailed analyses of the responsible AI landscape, and an entirely new chapter dedicated to AI's impact on science and medicine. The AI Index report tracks, collates, distills, and visualizes data related to artificial intelligence (AI). Our mission is to provide unbiased, rigorously vetted, broadly sourced data in order for policymakers, researchers, executives, journalists, and the general public to develop a more thorough and nuanced understanding of the complex field of AI. The AI Index is recognized globally as one of the most credible and authoritative sources for data and insights on artificial intelligence. Previous editions have been cited in major newspapers, including the The New York Times, Bloomberg, and The Guardian, have amassed hundreds of academic citations, and been referenced by high-level policymakers in the United States, the United Kingdom, and the European Union, among other places. This year's edition surpasses all previous ones in size, scale, and scope, reflecting the growing significance that AI is coming to hold in all of our lives.
AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning
Zhang, Jianguo, Lan, Tian, Murthy, Rithesh, Liu, Zhiwei, Yao, Weiran, Tan, Juntao, Hoang, Thai, Yang, Liangwei, Feng, Yihao, Liu, Zuxin, Awalgaonkar, Tulika, Niebles, Juan Carlos, Savarese, Silvio, Heinecke, Shelby, Wang, Huan, Xiong, Caiming
Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories. In this paper, we introduce AgentOhana as a comprehensive solution to address these challenges. Leveraging the data unification, our training pipeline maintains equilibrium across different data sources and preserves independent randomness across devices during dataset partitioning and model training. Additionally, we present xLAM-v0.1, a large action model tailored for AI agents, which demonstrates exceptional performance across various benchmarks. Large language models (LLMs) have shown strong abilities in code generation, mathematical reasoning, conversational AI, and AI agents (OpenAI, 2023; Jiang et al., 2023; Zhang et al., 2023; Liu et al., 2023a; Nijkamp et al., 2023). Among these, LLM-powered autonomous agents are gaining increasing attention.