AITopics | Patel, Maitreya

Collaborating Authors

Patel, Maitreya

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark

Fallah, Forouzan, Patel, Maitreya, Chatterjee, Agneet, Morariu, Vlad I., Baral, Chitta, Yang, Yezhou

arXiv.org Artificial IntelligenceMar-17-2025

Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to embed text within a generated image is complicated due to the lack of comprehensive benchmarks. In this work, we introduce TextInVision, a large-scale, text and prompt complexity driven benchmark designed to evaluate the ability of diffusion models to effectively integrate visual text into images. We crafted a diverse set of prompts and texts that consider various attributes and text characteristics. Additionally, we prepared an image dataset to test Variational Autoencoder (VAE) models across different character representations, highlighting that VAE architectures can also pose challenges in text generation within diffusion frameworks. Through extensive analysis of multiple models, we identify common errors and highlight issues such as spelling inaccuracies and contextual mismatches. By pinpointing the failure points across different prompts and texts, our research lays the foundation for future advancements in AI-generated multimodal content.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2503.1373

Country:

North America > United States (0.28)
Europe (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.68)
Marketing (0.49)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

Steering Rectified Flow Models in the Vector Field for Controlled Image Generation

Patel, Maitreya, Wen, Song, Metaxas, Dimitris N., Yang, Yezhou

arXiv.org Machine LearningNov-27-2024

Diffusion models (DMs) excel in photorealism, image editing, and solving inverse problems, aided by classifier-free guidance and image inversion techniques. However, rectified flow models (RFMs) remain underexplored for these tasks. Existing DM-based methods often require additional training, lack generalization to pretrained latent models, underperform, and demand significant computational resources due to extensive backpropagation through ODE solvers and inversion processes. In this work, we first develop a theoretical and empirical understanding of the vector field dynamics of RFMs in efficiently guiding the denoising trajectory. Our findings reveal that we can navigate the vector field in a deterministic and gradient-free manner. Utilizing this property, we propose FlowChef, which leverages the vector field to steer the denoising trajectory for controlled image generation tasks, facilitated by gradient skipping. FlowChef is a unified framework for controlled image generation that, for the first time, simultaneously addresses classifier guidance, linear inverse problems, and image editing without the need for extra training, inversion, or intensive backpropagation. Finally, we perform extensive evaluations and show that FlowChef significantly outperforms baselines in terms of performance, memory, and time requirements, achieving new state-of-the-art results. Project Page: \url{https://flowchef.github.io}.

artificial intelligence, flowchef, machine learning, (15 more...)

arXiv.org Machine Learning

2412.001

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model

Cheng, Sheng, Patel, Maitreya, Yang, Yezhou

arXiv.org Artificial IntelligenceNov-7-2024

Despite advancements in text-to-image models, generating images that precisely align with textual descriptions remains challenging due to misalignment in training data. In this paper, we analyze the critical role of caption precision and recall in text-to-image model training. Our analysis of human-annotated captions shows that both precision and recall are important for text-image alignment, but precision has a more significant impact. Leveraging these insights, we utilize Large Vision Language Models to generate synthetic captions for training. Models trained with these synthetic captions show similar behavior to those trained on human-annotated captions, underscores the potential for synthetic data in text-to-image training.

caption, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2411.05079

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Patel, Maitreya, Kusumba, Abhiram, Cheng, Sheng, Kim, Changhoon, Gokhale, Tejas, Baral, Chitta, Yang, Yezhou

arXiv.org Artificial IntelligenceNov-4-2024

Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning ability of CLIP. We show that generating ``hard'' negative captions via in-context learning and synthesizing corresponding negative images with text-to-image generators offers a solution. We introduce a novel contrastive pre-training strategy that leverages these hard negative captions and images in an alternating fashion to train CLIP. We demonstrate that our method, named TripletCLIP, when applied to existing datasets such as CC3M and CC12M, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark on an equal computational budget, as well as improvements in zero-shot image classification and image retrieval. Our code, models, and data are available at: https://tripletclip.github.io

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2411.02545

Country:

North America > United States > Maryland (0.28)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.67)

Industry:

Leisure & Entertainment (1.00)
Media (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

$\lambda$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space

Patel, Maitreya, Jung, Sangmin, Baral, Chitta, Yang, Yezhou

arXiv.org Artificial IntelligenceFeb-7-2024

Despite the recent advances in personalized text-to-image (P-T2I) generative models, subject-driven T2I remains challenging. The primary bottlenecks include 1) Intensive training resource requirements, 2) Hyper-parameter sensitivity leading to inconsistent outputs, and 3) Balancing the intricacies of novel visual concept and composition alignment. We start by re-iterating the core philosophy of T2I diffusion models to address the above limitations. Predominantly, contemporary subject-driven T2I approaches hinge on Latent Diffusion Models (LDMs), which facilitate T2I mapping through cross-attention layers. While LDMs offer distinct advantages, P-T2I methods' reliance on the latent space of these diffusion models significantly escalates resource demands, leading to inconsistent results and necessitating numerous iterations for a single desired image. Recently, ECLIPSE has demonstrated a more resource-efficient pathway for training UnCLIP-based T2I models, circumventing the need for diffusion text-to-image priors. Building on this, we introduce $\lambda$-ECLIPSE. Our method illustrates that effective P-T2I does not necessarily depend on the latent space of diffusion models. $\lambda$-ECLIPSE achieves single, multi-subject, and edge-guided T2I personalization with just 34M parameters and is trained on a mere 74 GPU hours using 1.6M image-text interleaved data. Through extensive experiments, we also establish that $\lambda$-ECLIPSE surpasses existing baselines in composition alignment while preserving concept alignment performance, even with significantly lower resource utilization.

artificial intelligence, diffusion model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2402.05195

Country:

Asia (0.14)
North America > United States (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models

Patel, Maitreya, Gokhale, Tejas, Baral, Chitta, Yang, Yezhou

arXiv.org Artificial IntelligenceJun-7-2023

The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualitative measures of visual understanding. To quantify the ability of T2I models in learning and synthesizing novel visual concepts, we introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts, 5K unique concept compositions, and 33K composite text prompts. Along with the dataset, we propose an evaluation metric, Concept Confidence Deviation (CCD), that uses the confidence of oracle concept classifiers to measure the alignment between concepts generated by T2I generators and concepts contained in ground truth images. We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions. Our human study shows that CCD is highly correlated with human understanding of concepts. Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome.

machine learning, natural language, object-oriented architecture, (18 more...)

arXiv.org Artificial Intelligence

2306.04695

Country:

North America > United States (0.67)
Asia > Middle East > UAE (0.14)

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.46)

Industry: Transportation (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Add feedback

Reasoning about Actions over Visual and Linguistic Modalities: A Survey

Sampat, Shailaja Keyur, Patel, Maitreya, Das, Subhasish, Yang, Yezhou, Baral, Chitta

arXiv.org Artificial IntelligenceJul-15-2022

As pointed out by [Davis and Marcus, 2015], 'Actions' play a vital role in how humans interact imagine a guest asks a robot for a glass of wine; if the robot with the world and enable them to achieve desired sees that the glass is broken or has a dead cockroach inside, it goals. As a result, most common sense (CS) knowledge should not pour the wine and serve it. Similarly, if a cat runs for humans revolves around actions. While in front of a house-cleaning robot, the robot should neither'Reasoning about Actions & Change' (RAC) has run it over nor sweep it up nor put it away on a shelf. Hence, been widely studied in the Knowledge Representation the ability of artificial agents to perform reasoning and integrate community, it has recently piqued the interest CS knowledge about actions is highly desirable.

artificial intelligence, logic & formal reasoning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2207.07568

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Industry:

Education (0.46)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis > Beverages (0.34)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.62)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback