Goto

Collaborating Authors

 Xiao, Shishi


ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map

arXiv.org Artificial Intelligence

Multi-modal embeddings form the foundation for vision-language models, such as CLIP embeddings, the most widely used text-image embeddings. However, these embeddings are vulnerable to subtle misalignment of cross-modal features, resulting in decreased model performance and diminished generalization. To address this problem, we design ModalChorus, an interactive system for visual probing and alignment of multi-modal embeddings. ModalChorus primarily offers a two-stage process: 1) embedding probing with Modal Fusion Map (MFM), a novel parametric dimensionality reduction method that integrates both metric and nonmetric objectives to enhance modality fusion; and 2) embedding alignment that allows users to interactively articulate intentions for both point-set and set-set alignments. Quantitative and qualitative comparisons for CLIP embeddings with existing dimensionality reduction (e.g., t-SNE and MDS) and data fusion (e.g., data context map) methods demonstrate the advantages of MFM in showcasing cross-modal features over common vision-language datasets. Case studies reveal that ModalChorus can facilitate intuitive discovery of misalignment and efficient re-alignment in scenarios ranging from zero-shot classification to cross-modal retrieval and generation.


Generative AI for Visualization: State of the Art and Future Directions

arXiv.org Artificial Intelligence

Generative AI (GenAI) has witnessed remarkable progress in recent years and demonstrated impressive performance in various generation tasks in different domains such as computer vision and computational design. Many researchers have attempted to integrate GenAI into visualization framework, leveraging the superior generative capacity for different operations. Concurrently, recent major breakthroughs in GenAI like diffusion model and large language model have also drastically increase the potential of GenAI4VIS. From a technical perspective, this paper looks back on previous visualization studies leveraging GenAI and discusses the challenges and opportunities for future research. Specifically, we cover the applications of different types of GenAI methods including sequence, tabular, spatial and graph generation techniques for different tasks of visualization which we summarize into four major stages: data enhancement, visual mapping generation, stylization and interaction. For each specific visualization sub-task, we illustrate the typical data and concrete GenAI algorithms, aiming to provide in-depth understanding of the state-of-the-art GenAI4VIS techniques and their limitations. Furthermore, based on the survey, we discuss three major aspects of challenges and research opportunities including evaluation, dataset, and the gap between end-to-end GenAI and generative algorithms. By summarizing different generation algorithms, their current applications and limitations, this paper endeavors to provide useful insights for future GenAI4VIS research.


TypeDance: Creating Semantic Typographic Logos from Image through Personalized Generation

arXiv.org Artificial Intelligence

One notable application is the semantic typographic logo, which symbolizes a unique identity in a concise yet informative manner. Due to its expressiveness and memorability [7], semantic typographic logo has been widely used as visual signatures for individuals [28], brand logos with commercial values [15, 20], and symbols for significant events and city promotions [3, 43]. However, crafting a semantic typographic logo presents a formidable challenge, requiring seamless blending of typeface and imagery while preserving readability. Experienced designers often rely on professional software like Adobe Illustrator to manually adjust the outline of the typeface to incorporate specific imagery, which is a time-consuming and error-prone process. They often experiment with different strokes or letters of typeface and various imageries to find a visually appealing and memorable representation, intensifying the lengthy process. This requires creative thinking, practical skills, and the ability to persist through continuous trial and error.


The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model

arXiv.org Artificial Intelligence

Image search is an essential and user-friendly method to explore vast galleries of digital images. However, existing image search methods heavily rely on proximity measurements like tag matching or image similarity, requiring precise user inputs for satisfactory results. To meet the growing demand for a contemporary image search engine that enables accurate comprehension of users' search intentions, we introduce an innovative user intent expansion framework. Our framework leverages visual-language models to parse and compose multi-modal user inputs to provide more accurate and satisfying results. It comprises two-stage processes: 1) a parsing stage that incorporates a language parsing module with large language models to enhance the comprehension of textual inputs, along with a visual parsing module that integrates an interactive segmentation module to swiftly identify detailed visual elements within images; and 2) a logic composition stage that combines multiple user search intents into a unified logic expression for more sophisticated operations in complex searching scenarios. Moreover, the intent expansion framework enables users to perform flexible contextualized interactions with the search results to further specify or adjust their detailed search intents iteratively. We implemented the framework into an image search system for NFT (non-fungible token) search and conducted a user study to evaluate its usability and novel properties. The results indicate that the proposed framework significantly improves users' image search experience. Particularly the parsing and contextualized interactions prove useful in allowing users to express their search intents more accurately and engage in a more enjoyable iterative search experience.


Let the Chart Spark: Embedding Semantic Context into Chart with Text-to-Image Generative Model

arXiv.org Artificial Intelligence

Pictorial visualization seamlessly integrates data and semantic context into visual representation, conveying complex information in a manner that is both engaging and informative. Extensive studies have been devoted to developing authoring tools to simplify the creation of pictorial visualizations. However, mainstream works mostly follow a retrieving-and-editing pipeline that heavily relies on retrieved visual elements from a dedicated corpus, which often compromise the data integrity. Text-guided generation methods are emerging, but may have limited applicability due to its predefined recognized entities. In this work, we propose ChartSpark, a novel system that embeds semantic context into chart based on text-to-image generative model. ChartSpark generates pictorial visualizations conditioned on both semantic context conveyed in textual inputs and data information embedded in plain charts. The method is generic for both foreground and background pictorial generation, satisfying the design practices identified from an empirical research into existing pictorial visualizations. We further develop an interactive visual interface that integrates a text analyzer, editing module, and evaluation module to enable users to generate, modify, and assess pictorial visualizations. We experimentally demonstrate the usability of our tool, and conclude with a discussion of the potential of using text-to-image generative model combined with interactive interface for visualization design.