AITopics | Feng, Weixi

Collaborating Authors

Feng, Weixi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

Feng, Weixi, Liu, Chao, Liu, Sifei, Wang, William Yang, Vahdat, Arash, Nie, Weili

arXiv.org Artificial IntelligenceJan-13-2025

Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives - blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models. Extensive experimental results show that BlobGEN-Vid achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. When combined with an LLM for layout planning, our framework even outperforms proprietary text-to-video generators in terms of compositional accuracy.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.07647

Genre: Research Report (0.84)

Industry:

Transportation > Ground > Road (0.46)
Automobiles & Trucks (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

He, Xuehai, Feng, Weixi, Zheng, Kaizhi, Lu, Yujie, Zhu, Wanrong, Li, Jiachen, Fan, Yue, Wang, Jianfeng, Li, Linjie, Yang, Zhengyuan, Lin, Kevin, Wang, William Yang, Wang, Lijuan, Wang, Xin Eric

arXiv.org Artificial IntelligenceJun-13-2024

Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.

large language model, machine learning, question answering, (21 more...)

arXiv.org Artificial Intelligence

2406.08407

Country: Europe > Middle East > Malta (0.14)

Genre: Research Report (0.82)

Industry:

Health & Medicine (1.00)
Information Technology (0.93)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(4 more...)

Add feedback

Reward Guided Latent Consistency Distillation

Li, Jiachen, Feng, Weixi, Chen, Wenhu, Wang, William Yang

arXiv.org Artificial IntelligenceMar-16-2024

Latent Consistency Distillation (LCD) has emerged as a promising paradigm for efficient text-to-image synthesis. By distilling a latent consistency model (LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates the generation of high-fidelity images within merely 2 to 4 inference steps. However, the LCM's efficient inference is obtained at the cost of the sample quality. In this paper, we propose compensating the quality loss by aligning LCM's output with human preference during training. Specifically, we introduce Reward Guided LCD (RG-LCD), which integrates feedback from a reward model (RM) into the LCD process by augmenting the original LCD loss with the objective of maximizing the reward associated with LCM's single-step generation. As validated through human evaluation, when trained with the feedback of a good RM, the 2-step generations from our RG-LCM are favored by humans over the 50-step DDIM samples from the teacher LDM, representing a 25 times inference acceleration without quality loss. As directly optimizing towards differentiable RMs can suffer from over-optimization, we overcome this difficulty by proposing the use of a latent proxy RM (LRM). This novel component serves as an intermediary, connecting our LCM with the RM. Empirically, we demonstrate that incorporating the LRM into our RG-LCD successfully avoids high-frequency noise in the generated images, contributing to both improved FID on MS-COCO and a higher HPSv2.1 score on HPSv2's test set, surpassing those achieved by the baseline LCM.

arxiv preprint arxiv, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2403.11027

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

Feng, Weixi, Zhu, Wanrong, Fu, Tsu-jui, Jampani, Varun, Akula, Arjun, He, Xuehai, Basu, Sugato, Wang, Xin Eric, Wang, William Yang

arXiv.org Artificial IntelligenceOct-28-2023

Attaining a high degree of user controllability in visual generation often requires intricate, fine-grained inputs like layouts. However, such inputs impose a substantial burden on users when compared to simple text inputs. To address the issue, we study how Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions, and thus collaborate with visual generative models. We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language to enhance the visual planning skills of LLMs. LayoutGPT can generate plausible layouts in multiple domains, ranging from 2D images to 3D indoor scenes. LayoutGPT also shows superior performance in converting challenging language concepts like numerical and spatial relations to layout arrangements for faithful text-to-image generation. When combined with a downstream image generation model, LayoutGPT outperforms text-to-image models/systems by 20-40% and achieves comparable performance as human users in designing visual layouts for numerical and spatial correctness. Lastly, Layout-GPT achieves comparable performance to supervised methods in 3D indoor scene synthesis, demonstrating its effectiveness and potential in multiple visual domains.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2305.15393

Country:

North America > United States > California (0.28)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (1.00)

Industry: Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

EDIS: Entity-Driven Image Search over Multimodal Web Content

Liu, Siqi, Feng, Weixi, Fu, Tsu-jui, Chen, Wenhu, Wang, William Yang

arXiv.org Artificial IntelligenceOct-23-2023

Making image retrieval methods practical for real-world search applications requires significant progress in dataset scales, entity comprehension, and multimodal information fusion. In this work, we introduce \textbf{E}ntity-\textbf{D}riven \textbf{I}mage \textbf{S}earch (EDIS), a challenging dataset for cross-modal image search in the news domain. EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description. Unlike datasets that assume a small set of single-modality candidates, EDIS reflects real-world web image search scenarios by including a million multimodal image-text pairs as candidates. EDIS encourages the development of retrieval models that simultaneously address cross-modal information fusion and matching. To achieve accurate ranking results, a model must: 1) understand named entities and events from text queries, 2) ground entities onto images or text descriptions, and 3) effectively fuse textual and visual representations. Our experimental results show that EDIS challenges state-of-the-art methods with dense entities and a large-scale candidate set. The ablation study also proves that fusing textual features with visual features is critical in improving retrieval results.

information retrieval, machine learning, pattern recognition, (18 more...)

arXiv.org Artificial Intelligence

2305.13631

Country:

North America > United States (1.00)
Asia (1.00)
Europe (0.68)

Genre: Research Report > New Finding (0.68)

Industry:

Government > Voting & Elections (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Banking & Finance (0.93)
(2 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Add feedback

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

Schumann, Raphael, Zhu, Wanrong, Feng, Weixi, Fu, Tsu-Jui, Riezler, Stefan, Wang, William Yang

arXiv.org Artificial IntelligenceJul-12-2023

Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2307.06082

Country:

North America > United States > California (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:

Research Report (0.64)
Workflow (0.47)

Industry:

Consumer Products & Services > Restaurants (0.46)
Transportation > Ground > Road (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Feng, Weixi, He, Xuehai, Fu, Tsu-Jui, Jampani, Varun, Akula, Arjun, Narayana, Pradyumna, Basu, Sugato, Wang, Xin Eric, Wang, William Yang

arXiv.org Artificial IntelligenceFeb-28-2023

Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. Attribute-binding requires the model to associate objects with the correct attribute descriptions, and compositional skills require the model to combine and generate multiple concepts into a single image. In this work, we improve these two aspects of T2I models to achieve more accurate image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, by manipulating the cross-attention representations based on linguistic insights, we can better preserve the compositional semantics in the generated image. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a significant 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process. Text-to-Image Synthesis (T2I) is to generate natural and faithful images given a text prompt as input. Recently, there has been a significant advancement in the quality of generated images by extremely large-scale vision-language models, such as DALL-E 2 (Ramesh et al., 2022), Imagen (Saharia et al., 2022), and Parti (Yu et al., 2022).

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2212.05032

Country: North America > United States > California (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.54)

Add feedback

Neuro-Symbolic Procedural Planning with Commonsense Prompting

Lu, Yujie, Feng, Weixi, Zhu, Wanrong, Xu, Wenda, Wang, Xin Eric, Eckstein, Miguel, Wang, William Yang

arXiv.org Artificial IntelligenceFeb-16-2023

Procedural planning aims to implement complex high-level goals by decomposition into sequential simpler low-level steps. Although procedural planning is a basic skill set for humans in daily life, it remains a challenge for large language models (LLMs) that lack a deep understanding of the cause-effect relations in procedures. Previous methods require manual exemplars to acquire procedural planning knowledge from LLMs in the zero-shot setting. However, such elicited pre-trained knowledge in LLMs induces spurious correlations between goals and steps, which impair the model generalization to unseen tasks. In contrast, this paper proposes a neuro-symbolic procedural PLANner (PLAN) that elicits procedural planning knowledge from the LLMs with commonsense-infused prompting. To mitigate spurious goal-step correlations, we use symbolic program executors on the latent procedural representations to formalize prompts from commonsense knowledge bases as a causal intervention toward the Structural Causal Model. Both automatic and human evaluations on WikiHow and RobotHow show the superiority of PLAN on procedural planning without further training or manual exemplars.

artificial intelligence, expert system, natural language, (18 more...)

arXiv.org Artificial Intelligence

2206.02928

Country:

Europe (0.92)
North America > United States > California (0.67)

Genre:

Workflow (0.94)
Research Report > Experimental Study (0.92)

Industry:

Health & Medicine (0.67)
Leisure & Entertainment > Games > Computer Games (0.31)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback