Not enough data to create a plot.
Try a different view from the menu above.
A Datasheet for Datasets
A.1 Motivation For what purpose was the dataset created? We create GTA (a benchmark for General Tool Agents) to evaluate the general tool-use ability of LLMs in real-world scenarios. The benchmark has human-written queries with simple real-world objectives but implicit tool-use, an evaluation platform equipped with executable tools across diverse categories, and authentic image files as context input. Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? Who funded the creation of the dataset? This work is supported by the National Key R&D Program of China (No. 2022ZD0161600), and the National Natural Science Foundation of China under Grants 62422311 and 62176152. A.2 Composition What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Each instance in GTA is in the JSON format. It contains natural language queries, image file inputs, tool descriptions, a reference tool chain, and a final answer. How many instances are there in total (of each type, if appropriate)? There are 229 instances in GTA, with 252 image files. Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? We will provide all instances in our GitHub repository for GTA. What data does each instance consist of? Each instance contains a natural language query, image file inputs, tool descriptions, a reference tool chain, and a final answer. Is there a label or target associated with each instance?
GTA: A Benchmark for General Tool Agents Jize Wang 1,2 Zerun Ma2 Yining Li2
Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AIgenerated queries, single-step tasks, dummy tools, and text-only interactions, failing to effectively reveal the agents' real-world problem-solving abilities. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps.
a576eafbce762079f7d1f77fca1c5cc2-AuthorFeedback.pdf
We thank the reviewers for their careful reading, feedback and helpful comments and address specific concerns below. Tasks are best retained using the double-sided approach. In the revised paper, we will make it more explicit that analyses of sections 5.3-5.5 were We agree that saying "were allowed" in line 195 is R1,3 noted that our set of tasks were limited. We view this as an important first step, but agree that real-world applications (e.g. Fixed point structures were highly overlapping upon visual inspection in TDR subspaces.
Pricing and Competition for Generative AI
Compared to classical machine learning (ML) models, generative models offer a new usage paradigm where (i) a single model can be used for many different tasks out-of-the-box; (ii) users interact with this model over a series of natural language prompts; and (iii) the model is ideally evaluated on binary user satisfaction with respect to model outputs. Given these characteristics, we explore the problem of how developers of new generative AI software can release and price their technology. We first develop a comparison of two different models for a specific task with respect to user cost-effectiveness. We then model the pricing problem of generative AI software as a game between two different companies who sequentially release their models before users choose their preferred model for each task. Here, the price optimization problem becomes piecewise continuous where the companies must choose a subset of the tasks on which to be cost-effective and forgo revenue for the remaining tasks. In particular, we reveal the value of market information by showing that a company who deploys later after knowing their competitor's price can always secure cost-effectiveness on at least one task, whereas the company who is the first-to-market must price their model in a way that incentivizes higher prices from the latecomer in order to gain revenue. Most importantly, we find that if the different tasks are sufficiently similar, the first-to-market model may become cost-ineffective on all tasks regardless of how this technology is priced.
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback
Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To address the challenge, consistency models have been proposed to facilitate fast inference, albeit at the cost of sample quality. In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achieve both fast and high-quality video generation. We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. Notably, we directly optimize rewards associated with single-step generations that arise naturally from computing the CD loss, effectively bypassing the memory constraints imposed by backpropagating gradients through an iterative sampling process. Remarkably, the 4-step generations from our T2V-Turbo achieve the highest total score on VBench [Huang et al., 2024], even surpassing Gen-2 [Esser et al., 2023] and
Efficient Pure Exploration in Adaptive Round model
Tianyuan Jin, Jieming SHI, Xiaokui Xiao, Enhong Chen
In the adaptive setting, many multi-armed bandit applications allow the learner to adaptively draw samples and adjust sampling strategy in rounds. In many real applications, not only the query complexity but also the round complexity need to be optimized. In this paper, we study both PAC and exact top-k arm identification problems and design efficient algorithms considering both round complexity and query complexity.
address some common concerns raised by the reviewers by providing additional experiment results
We thank the reviewers for their insightful comments, which we will incorporate into the revised version. We adopt the s2v in our paper since it satisfies these requirements. We will elaborate on the details in our revision. We use 2 layers of GNN by default, or use -k after the name in Table 1 to denote k-layer design. The results are presented in Table 2. Despite the noisiness of the full USPTO set relative to the schneider-50k subset, our method still outperforms the two best baselines in top-k accuracies.
Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning
As a marriage between offline RL and meta-RL, the advent of offline metareinforcement learning (OMRL) has shown great promise in enabling RL agents to multi-task and quickly adapt while acquiring knowledge safely. Among which, context-based OMRL (COMRL) as a popular paradigm, aims to learn a universal policy conditioned on effective task representations. In this work, by examining several key milestones in the field of COMRL, we propose to integrate these seemingly independent methodologies into a unified framework. Most importantly, we show that the pre-existing COMRL algorithms are essentially optimizing the same mutual information objective between the task variable M and its latent representation Z by implementing various approximate bounds. Such theoretical insight offers ample design freedom for novel algorithms. As demonstrations, we propose a supervised and a self-supervised implementation of I(Z; M), and empirically show that the corresponding optimization algorithms exhibit remarkable generalization across a broad spectrum of RL benchmarks, context shift scenarios, data qualities and deep learning architectures. This work lays the information theoretic foundation for COMRL methods, leading to a better understanding of task representation learning in the context of reinforcement learning. Given its generality, we envision our framework as a promising offline pre-training paradigm of foundation models for decision making.