Goto

Collaborating Authors

 chatbot


A Large-Scale Human-Centric Benchmark for Referring Expression Comprehension in the LMM Era

Neural Information Processing Systems

Prior research in human-centric AI has primarily addressed single-modality tasks like pedestrian detection, action recognition, and pose estimation. However, the emergence of large multimodal models (LMMs) such as GPT-4V has redirected attention towards integrating language with visual content. Referring expression comprehension (REC) represents a prime example of this multimodal approach.


VideoGUI: A Benchmark for GUI Automation from Instructional Videos Kevin Qinghong Lin

Neural Information Processing Systems

Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software (e.g., Adobe Photoshop or Stable Diffusion WebUI) and complex activities (e.g., video editing). VideoGUI evaluates GUI assistants through a hierarchical process, allowing for identification of the specific levels at which they may fail: (i) high-level planning: reconstruct procedural subtasks from visual conditions without language descriptions; (ii) middle-level planning: generate sequences of precise action narrations based on visual state (i.e., screenshot) and goals; (iii) atomic action execution: perform specific actions such as accurately clicking designated elements. For each level, we design evaluation metrics across individual dimensions to provide clear signals, such as individual performance in clicking, dragging, typing, and scrolling for atomic action execution. Our evaluation on VideoGUI reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks, especially for high-level planning.



Quantitatively Understanding the Bitter Lesson Through Capabilities Trajectories

Neural Information Processing Systems

Performance on popular ML benchmarks is highly correlated with model scale, suggesting that most benchmarks tend to measure a similar underlying factor of general model capabilities. However, substantial research effort remains devoted to designing new benchmarks, many of which claim to measure novel phenomena. In the spirit of the Bitter Lesson, we leverage spectral analysis to measure an underlying capabilities component, the direction in benchmark-performance-space which explains most variation in model performance. In an extensive analysis of existing safety benchmarks, we find that variance in model performance on many safety benchmarks is largely explained by the capabilities component. In response, we argue that safety research should prioritize metrics which are not highly correlated with scale. Our work provides a lens to analyze both novel safety benchmarks and novel safety methods, which we hope will enable future work to make differential progress on safety.


ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification Yefei He1 Weijia Wu2

Neural Information Processing Systems

KV cache stores key and value states from previous tokens to avoid re-computation, yet it demands substantial storage space, especially for long sequences. Adaptive KV cache compression seeks to discern the saliency of tokens, preserving vital information while aggressively compressing those of less importance.


96% of IT pros say AI agents are a security risk, but they're deploying them anyway

ZDNet

AI agents are being rapidly deployed within organizations even as they sow security fears, according to a new report from data governance firm SailPoint. Based on a global survey of more than 350 IT professionals, the report found that the widespread embrace of agents -- AI systems capable of formulating plans and taking action without human oversight -- is taking place within a security vacuum. Of IT pros who responded, 84% said their organizations already use agents internally, but just over half that number (44%) currently have policies in place to control the agents' behavior. Even more strikingly, 96% of respondents said they view agents as a security risk, yet 98% also said their employers plan to expand their use of agents in the coming year. Agents are the latest wave in a flood of innovation surrounding generative AI, which began in earnest following OpenAI's release of ChatGPT in late 2022.


Generating a Diversity of Challenging Programming Puzzles with Autotelic Generative Models

Neural Information Processing Systems

The ability to invent novel and interesting problems is a remarkable feature of human intelligence that drives innovation, art, and science. We propose a method that aims to automate this process by harnessing the power of state-of-the-art generative models to produce a diversity of challenging yet solvable problems, here in the context of Python programming puzzles. Inspired by the intrinsically motivated literature, Autotelic CodE Search (ACES) jointly optimizes for the diversity and difficulty of generated problems. We represent problems in a space of LLM-generated semantic descriptors describing the programming skills required to solve them (e.g.


UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

Neural Information Processing Systems

The use of Retrieval-Augmented Generation (RAG) has improved Large Language Models (LLMs) in collaborating with external data, yet significant challenges exist in real-world scenarios. In areas such as academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats, which can be lengthy and highly unstructured. In this paper, we introduce a benchmark suite, namely Unstructured Document Analysis (UDA), that involves 2,965 real-world documents and 29,590 expert-annotated Q&A pairs. We revisit popular LLMand RAG-based solutions for document analysis and evaluate the design choices and answer qualities across multiple document domains and diverse query types. Our evaluation yields interesting findings and highlights the importance of data parsing and retrieval. We hope our benchmark can shed light and better serve real-world document analysis applications.


Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation

Neural Information Processing Systems

Despite recent progress made by large language models in code generation, they still struggle with programs that meet complex requirements. Recent work utilizes plan-and-solve decomposition to decrease the complexity and leverage selftests to refine the generated program. Yet, planning deep-inside requirements in advance can be challenging, and the tests need to be accurate to accomplish self-improvement.


cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers

Neural Information Processing Systems

An emerging area of research in situated and multimodal interactive conversations (SIMMC) includes interactions in scientific papers. Since scientific papers are primarily composed of text, equations, figures, and tables, SIMMC methods must be developed specifically for each component to support the depth of inquiry and interactions required by research scientists.