collage
- North America > Canada > Ontario > Toronto (0.86)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Vision (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > New York (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Vision (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Representational Similarity via Interpretable Visual Concepts
Kondapaneni, Neehar, Mac Aodha, Oisin, Perona, Pietro
How do two deep neural networks differ in how they arrive at a decision? Measuring the similarity of deep networks has been a long-standing open question. Most existing methods provide a single number to measure the similarity of two networks at a given layer, but give no insight into what makes them similar or dissimilar. We introduce an interpretable representational similarity method (RSVC) to compare two networks. We use RSVC to discover shared and unique visual concepts between two models. We show that some aspects of model differences can be attributed to unique concepts discovered by one model that are not well represented in the other. Finally, we conduct extensive evaluation across different vision model architectures and training protocols to demonstrate its effectiveness.
- Europe > Austria > Vienna (0.04)
- Africa > Mozambique > Gaza Province > Xai-Xai (0.04)
CompCap: Improving Multimodal Large Language Models with Composite Captions
Chen, Xiaohui, Shukla, Satya Narayan, Azab, Mahmoud, Singh, Aashu, Wang, Qifan, Yang, David, Peng, ShengYun, Yu, Hanchao, Yan, Shen, Zhang, Xuewen, He, Baosheng
How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.
- North America > Mexico (0.14)
- North America > The Bahamas (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- (122 more...)
- Government (0.93)
- Transportation > Passenger (0.45)
Collage: Decomposable Rapid Prototyping for Information Extraction on Scientific PDFs
Gururaja, Sireesh, Zhang, Yueheng, Tang, Guannan, Zhang, Tianhao, Murphy, Kevin, Yi, Yu-Tsen, Seo, Junwon, Rollett, Anthony, Strubell, Emma
Recent years in NLP have seen the continued development of domain-specific information extraction tools for scientific documents, alongside the release of increasingly multimodal pretrained transformer models. While the opportunity for scientists outside of NLP to evaluate and apply such systems to their own domains has never been clearer, these models are difficult to compare: they accept different input formats, are often black-box and give little insight into processing failures, and rarely handle PDF documents, the most common format of scientific publication. In this work, we present Collage, a tool designed for rapid prototyping, visualization, and evaluation of different information extraction models on scientific PDFs. Collage allows the use and evaluation of any HuggingFace token classifier, several LLMs, and multiple other task-specific models out of the box, and provides extensible software interfaces to accelerate experimentation with new models. Further, we enable both developers and users of NLP-based tools to inspect, debug, and better understand modeling pipelines by providing granular views of intermediate states of processing. We demonstrate our system in the context of information extraction to assist with literature review in materials science.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (4 more...)
- Research Report (0.50)
- Overview (0.35)
COLLAGE: Collaborative Human-Agent Interaction Generation using Hierarchical Latent Diffusion and Language Models
Daiya, Divyanshu, Conover, Damon, Bera, Aniket
We propose a novel framework COLLAGE for generating collaborative agent-object-agent interactions by leveraging large language models (LLMs) and hierarchical motion-specific vector-quantized variational autoencoders (VQ-VAEs). Our model addresses the lack of rich datasets in this domain by incorporating the knowledge and reasoning abilities of LLMs to guide a generative diffusion model. The hierarchical VQ-VAE architecture captures different motion-specific characteristics at multiple levels of abstraction, avoiding redundant concepts and enabling efficient multi-resolution representation. We introduce a diffusion model that operates in the latent space and incorporates LLM-generated motion planning cues to guide the denoising process, resulting in prompt-specific motion generation with greater control and diversity. Experimental results on the CORE-4D, and InterHuman datasets demonstrate the effectiveness of our approach in generating realistic and diverse collaborative human-object-human interactions, outperforming state-of-the-art methods. Our work opens up new possibilities for modeling complex interactions in various domains, such as robotics, graphics and computer vision.
- Asia > Middle East > Saudi Arabia > Northern Borders Province > Arar (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
Collage is the New Writing: Exploring the Fragmentation of Text and User Interfaces in AI Tools
This essay proposes and explores the concept of Collage for the design of AI writing tools, transferred from avant-garde literature with four facets: 1) fragmenting text in writing interfaces, 2) juxtaposing voices (content vs command), 3) integrating material from multiple sources (e.g. text suggestions), and 4) shifting from manual writing to editorial and compositional decision-making, such as selecting and arranging snippets. The essay then employs Collage as an analytical lens to analyse the user interface design of recent AI writing tools, and as a constructive lens to inspire new design directions. Finally, a critical perspective relates the concerns that writers historically expressed through literary collage to AI writing tools. In a broad view, this essay explores how literary concepts can help advance design theory around AI writing tools. It encourages creators of future writing tools to engage not only with new technological possibilities, but also with past writing innovations.
- North America > United States > California > San Francisco County > San Francisco (0.28)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > New York > New York County > New York City (0.06)
- (25 more...)
- Information Technology > Human Computer Interaction > Interfaces (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
SPLICE -- Streamlining Digital Pathology Image Processing
Alsaafin, Areej, Nejat, Peyman, Shafique, Abubakr, Khan, Jibran, Alfasly, Saghir, Alabtah, Ghazal, Tizhoosh, H. R.
Digital pathology and the integration of artificial intelligence (AI) models have revolutionized histopathology, opening new opportunities. With the increasing availability of Whole Slide Images (WSIs), there's a growing demand for efficient retrieval, processing, and analysis of relevant images from vast biomedical archives. However, processing WSIs presents challenges due to their large size and content complexity. Full computer digestion of WSIs is impractical, and processing all patches individually is prohibitively expensive. In this paper, we propose an unsupervised patching algorithm, Sequential Patching Lattice for Image Classification and Enquiry (SPLICE). This novel approach condenses a histopathology WSI into a compact set of representative patches, forming a "collage" of WSI while minimizing redundancy. SPLICE prioritizes patch quality and uniqueness by sequentially analyzing a WSI and selecting non-redundant representative features. We evaluated SPLICE for search and match applications, demonstrating improved accuracy, reduced computation time, and storage requirements compared to existing state-of-the-art methods. As an unsupervised method, SPLICE effectively reduces storage requirements for representing tissue images by 50%. This reduction enables numerous algorithms in computational pathology to operate much more efficiently, paving the way for accelerated adoption of digital pathology.
- Research Report > Promising Solution (1.00)
- Research Report > New Finding (0.68)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Diagnostic Medicine (1.00)