AITopics | Wang, Zora Zhiruo

Collaborating Authors

Wang, Zora Zhiruo

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Benchmarking Failures in Tool-Augmented Language Models

Treviño, Eduardo, Contant, Hugo, Ngai, James, Neubig, Graham, Wang, Zora Zhiruo

arXiv.org Artificial IntelligenceMar-18-2025

The integration of tools has extended the capabilities of language models (LMs) beyond vanilla text generation to versatile scenarios. However, tool-augmented language models (TaLMs) often assume 'perfect' information access and tool availability, which may not hold in the real world. To systematically study TaLMs' imperfections, we introduce the FAIL-TALMS benchmark, featuring two major failures: under-specified user queries and non-available tools. FAIL-TALMS contains 1,749 examples using 906 tools across 21 categories, including single- and multi-tool usage. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information. Further, to study possible mitigation of the failures, we enable real-time human interaction, named the Ask-and-Help (AAH) method, to provide missing information or replace non-functional tools. While AAH can help models solve tasks more correctly when queries are under-specified, it brings minimal benefit when complex tools are broken.

information, knowledge, query, (15 more...)

arXiv.org Artificial Intelligence

2503.14227

Country: North America > United States (0.46)

Industry: Transportation (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.31)

Add feedback

CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

Huq, Faria, Wang, Zora Zhiruo, Xu, Frank F., Ou, Tianyue, Zhou, Shuyan, Bigham, Jeffrey P., Neubig, Graham

arXiv.org Artificial IntelligenceFeb-9-2025

While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.16609

Country: North America > United States (0.46)

Genre:

Research Report (0.50)
Workflow (0.47)

Industry: Information Technology (0.68)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.77)

Add feedback

AutoPresent: Designing Structured Visuals from Scratch

Ge, Jiaxin, Wang, Zora Zhiruo, Zhou, Xuhui, Peng, Yi-Hao, Subramanian, Sanjay, Tan, Qinyue, Sap, Maarten, Suhr, Alane, Fried, Daniel, Neubig, Graham, Darrell, Trevor

arXiv.org Artificial IntelligenceJan-1-2025

Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide's quality. We hope that our work will provide a basis for future work on generating structured visuals.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.00912

Country: North America > United States > California (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

Add feedback

CodeRAG-Bench: Can Retrieval Augment Code Generation?

Wang, Zora Zhiruo, Asai, Akari, Yu, Xinyan Velocity, Xu, Frank F., Xie, Yiqing, Neubig, Graham, Fried, Daniel

arXiv.org Artificial IntelligenceJun-20-2024

While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving code generation remains under-explored. In this work, we conduct a systematic, large-scale analysis by asking: in what scenarios can retrieval benefit code generation models? and what challenges remain? We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks, including basic programming, open-domain, and repository-level problems. We aggregate documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources. While notable gains are made in final code generation by retrieving high-quality contexts across various settings, our analysis reveals room for improvement -- current retrievers still struggle to fetch useful contexts especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. We hope CodeRAG-Bench serves as an effective testbed to encourage further development of advanced code-oriented RAG methods.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2406.14497

Country:

North America > Canada (0.14)
North America > United States > California (0.14)

Genre: Research Report (0.64)

Industry: Education > Educational Setting (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback