Not enough data to create a plot.
Try a different view from the menu above.
Xiong, Caiming
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Xie, Tianbao, Zhang, Danyang, Chen, Jixuan, Li, Xiaochuan, Zhao, Siheng, Cao, Ruisheng, Hua, Toh Jing, Cheng, Zhoujun, Shin, Dongchan, Lei, Fangyu, Liu, Yitao, Xu, Yiheng, Zhou, Shuyan, Savarese, Silvio, Xiong, Caiming, Zhong, Victor, Yu, Tao
Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.
Investigating the prompt leakage effect and black-box defenses for multi-turn LLM interactions
Agarwal, Divyansh, Fabbri, Alexander R., Laban, Philippe, Risher, Ben, Joty, Shafiq, Xiong, Caiming, Wu, Chien-Sheng
Prompt leakage in large language models (LLMs) poses a significant security and privacy threat, particularly in retrieval-augmented generation (RAG) systems. However, leakage in multi-turn LLM interactions along with mitigation strategies has not been studied in a standardized manner. This paper investigates LLM vulnerabilities against prompt leakage across 4 diverse domains and 10 closed- and open-source LLMs. Our unique multi-turn threat model leverages the LLM's sycophancy effect and our analysis dissects task instruction and knowledge leakage in the LLM response. In a multi-turn setting, our threat model elevates the average attack success rate (ASR) to 86.2%, including a 99% leakage with GPT-4 and claude-1.3. We find that some black-box LLMs like Gemini show variable susceptibility to leakage across domains - they are more likely to leak contextual knowledge in the news domain compared to the medical domain. Our experiments measure specific effects of 6 black-box defense strategies, including a query-rewriter in the RAG scenario. Our proposed multi-tier combination of defenses still has an ASR of 5.3% for black-box LLMs, indicating room for enhancement and future direction for LLM security research.
How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library
Ravaut, Mathieu, Ding, Bosheng, Jiao, Fangkai, Chen, Hailin, Li, Xingxuan, Zhao, Ruochen, Qin, Chengwei, Xiong, Caiming, Joty, Shafiq
With the rise of Large Language Models (LLMs) in recent years, new opportunities are emerging, but also new challenges, and contamination is quickly becoming critical. Business applications and fundraising in AI have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of dollars, placing high pressure on model integrity. At the same time, it is becoming harder and harder to keep track of the data that LLMs have seen; if not impossible with closed-source models like GPT-4 and Claude-3 not divulging any information on the training set. As a result, contamination becomes a critical issue: LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. This limitation jeopardizes the entire progress in the field of NLP, yet, there remains a lack of methods on how to efficiently address contamination, or a clear consensus on prevention, mitigation and classification of contamination.
Fair Abstractive Summarization of Diverse Perspectives
Zhang, Yusen, Zhang, Nan, Liu, Yixin, Fabbri, Alexander, Liu, Junru, Kamoi, Ryo, Lu, Xiaoxin, Xiong, Caiming, Zhao, Jieyu, Radev, Dragomir, McKeown, Kathleen, Zhang, Rui
People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization. In this paper, we systematically investigate fair abstractive summarization for user-generated data. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and we propose four reference-free automatic metrics by measuring the differences between target and source perspectives. We evaluate nine LLMs, including three GPT models, four LLaMA models, PaLM 2, and Claude, on six datasets collected from social media, online reviews, and recorded transcripts. Experiments show that both the model-generated and the human-written reference summaries suffer from low fairness. We conduct a comprehensive analysis of the common factors influencing fairness and propose three simple but effective methods to alleviate unfair summarization. Our dataset and code are available at https://github.com/psunlpgroup/FairSumm.
AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning
Zhang, Jianguo, Lan, Tian, Murthy, Rithesh, Liu, Zhiwei, Yao, Weiran, Tan, Juntao, Hoang, Thai, Yang, Liangwei, Feng, Yihao, Liu, Zuxin, Awalgaonkar, Tulika, Niebles, Juan Carlos, Savarese, Silvio, Heinecke, Shelby, Wang, Huan, Xiong, Caiming
Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories. In this paper, we introduce AgentOhana as a comprehensive solution to address these challenges. Leveraging the data unification, our training pipeline maintains equilibrium across different data sources and preserves independent randomness across devices during dataset partitioning and model training. Additionally, we present xLAM-v0.1, a large action model tailored for AI agents, which demonstrates exceptional performance across various benchmarks. Large language models (LLMs) have shown strong abilities in code generation, mathematical reasoning, conversational AI, and AI agents (OpenAI, 2023; Jiang et al., 2023; Zhang et al., 2023; Liu et al., 2023a; Nijkamp et al., 2023). Among these, LLM-powered autonomous agents are gaining increasing attention.
FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability
Xia, Congying, Xing, Chen, Du, Jiangshu, Yang, Xinyi, Feng, Yihao, Xu, Ran, Yin, Wenpeng, Xiong, Caiming
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats, a crucial yet underexamined capability for their application as AI agents. Despite LLMs' advancements, existing benchmarks fail to assess their format-following proficiency adequately. FoFo fills this gap with a diverse range of real-world formats and instructions, developed through an AI-Human collaborative method. Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs' format-following performance is independent of their content generation quality; and LLMs' format proficiency varies across different domains. These insights suggest the need for specialized tuning for format-following skills and highlight FoFo's role in guiding the selection of domain-specific AI agents. FoFo is released here at https://github.com/SalesforceAIResearch/FoFo.
AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System
Liu, Zhiwei, Yao, Weiran, Zhang, Jianguo, Yang, Liangwei, Liu, Zuxin, Tan, Juntao, Choubey, Prafulla K., Lan, Tian, Wu, Jason, Wang, Huan, Heinecke, Shelby, Xiong, Caiming, Savarese, Silvio
The booming success of LLMs initiates rapid development in LLM agents. Though the foundation of an LLM agent is the generative model, it is critical to devise the optimal reasoning strategies and agent architectures. Accordingly, LLM agent research advances from the simple chain-of-thought prompting to more complex ReAct and Reflection reasoning strategy; agent architecture also evolves from single agent generation to multi-agent conversation, as well as multi-LLM multi-agent group chat. However, with the existing intricate frameworks and libraries, creating and evaluating new reasoning strategies and agent architectures has become a complex challenge, which hinders research investigation into LLM agents. Thus, we open-source a new AI agent library, AgentLite, which simplifies this process by offering a lightweight, user-friendly platform for innovating LLM agent reasoning, architectures, and applications with ease. AgentLite is a task-oriented framework designed to enhance the ability of agents to break down tasks and facilitate the development of multi-agent systems. Furthermore, we introduce multiple practical applications developed with AgentLite to demonstrate its convenience and flexibility. Get started now at: \url{https://github.com/SalesforceAIResearch/AgentLite}.
Unified Training of Universal Time Series Forecasting Transformers
Woo, Gerald, Liu, Chenghao, Kumar, Akshat, Xiong, Caiming, Savarese, Silvio, Sahoo, Doyen
Deep learning for time series forecasting has traditionally operated within a one-model-per-dataset framework, limiting its potential to leverage the game-changing impact of large pre-trained models. The concept of universal forecasting, emerging from pre-training on a vast collection of time series datasets, envisions a single Large Time Series Model capable of addressing diverse downstream forecasting tasks. However, constructing such a model poses unique challenges specific to time series data: i) cross-frequency learning, ii) accommodating an arbitrary number of variates for multivariate time series, and iii) addressing the varying distributional properties inherent in large-scale data. To address these challenges, we present novel enhancements to the conventional time series Transformer architecture, resulting in our proposed Masked Encoder-based Universal Time Series Forecasting Transformer (Moirai). Trained on our newly introduced Large-scale Open Time Series Archive (LOTSA) featuring over 27B observations across nine domains, Moirai achieves competitive or superior performance as a zero-shot forecaster when compared to full-shot models. Code, model weights, and data will be released.
Efficiently Aligned Cross-Lingual Transfer Learning for Conversational Tasks using Prompt-Tuning
Tu, Lifu, Qu, Jin, Yavuz, Semih, Joty, Shafiq, Liu, Wenhao, Xiong, Caiming, Zhou, Yingbo
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks, but focus on conversational tasks has been rather limited. This is partly due to the high cost of obtaining non-English conversational data, which results in limited coverage. In this work, we introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset that we created by translating the English-only Schema-Guided Dialogue (SGD) dataset (Rastogi et al., 2020) into 105 other languages. XSGD contains approximately 330k utterances per language. To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts. We also investigate two different classifiers: NLI-based and vanilla classifiers, and test cross-lingual capability enabled by the aligned prompts. We evaluate our model's cross-lingual generalization capabilities on two conversation tasks: slot-filling and intent classification. Our results demonstrate the strong and efficient modeling ability of NLI-based classifiers and the large cross-lingual transfer improvements achieved by our aligned prompts, particularly in few-shot settings. In addition, we highlight the nice results of our approach compared to LLMs such as text-davinci-003 and ChatGPT in both zero-shot and few-shot settings. While LLMs exhibit impressive performance in English, their cross-lingual capabilities in other languages, particularly low-resource languages, are limited.
TrustLLM: Trustworthiness in Large Language Models
Sun, Lichao, Huang, Yue, Wang, Haoran, Wu, Siyuan, Zhang, Qihui, Gao, Chujie, Huang, Yixin, Lyu, Wenhan, Zhang, Yixuan, Li, Xiner, Liu, Zhengliang, Liu, Yixin, Wang, Yijue, Zhang, Zhikun, Kailkhura, Bhavya, Xiong, Caiming, Xiao, Chaowei, Li, Chunyuan, Xing, Eric, Huang, Furong, Liu, Hao, Ji, Heng, Wang, Hongyi, Zhang, Huan, Yao, Huaxiu, Kellis, Manolis, Zitnik, Marinka, Jiang, Meng, Bansal, Mohit, Zou, James, Pei, Jian, Liu, Jian, Gao, Jianfeng, Han, Jiawei, Zhao, Jieyu, Tang, Jiliang, Wang, Jindong, Mitchell, John, Shu, Kai, Xu, Kaidi, Chang, Kai-Wei, He, Lifang, Huang, Lifu, Backes, Michael, Gong, Neil Zhenqiang, Yu, Philip S., Chen, Pin-Yu, Gu, Quanquan, Xu, Ran, Ying, Rex, Ji, Shuiwang, Jana, Suman, Chen, Tianlong, Liu, Tianming, Zhou, Tianyi, Wang, William, Li, Xiang, Zhang, Xiangliang, Wang, Xiao, Xie, Xing, Chen, Xun, Wang, Xuyu, Liu, Yan, Ye, Yanfang, Cao, Yinzhi, Chen, Yong, Zhao, Yue
Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.