An Inpainting-Infused Pipeline for Attire and Background Replacement

Perche-Mahlow, Felipe Rodrigues, Felipe-Zanella, André, Cruz-Castañeda, William Alberto, Amadeus, Marcellus

arXiv.org Artificial Intelligence 

The extraordinary advancement in Generative Artificial Intelligence (GenAI) has caused a transformative shift in our approach to complex tasks incorporating various modalities such as text, audio, video, and image generation. GenAI, as a broad category, excels at creating synthetic data that can closely mimic real-world phenomena, showcasing its prowess in diverse creative applications. In text generation, models like OpenAI's GPT (Generative Pre-trained Transformer) [OpenAI, 2023] are revolutionizing how society writes. These models, trained on massive corpora of text data, exhibit an impressive ability to understand context, generate coherent paragraphs, and even complete sentences in a very consistent way [Roumeliotis and Tselikas, 2023]. The ability to produce fluent and relevant textual content has established applications in natural language processing, content creation, and even automated writing [Huang and Tan, 2023]. Audio generation models, exemplified by technologies such as Tacotron [Wang et al., 2017] and WaveNet [Oord et al., 2016], have significantly advanced our ability to synthesize realistic speech patterns. These models take advantage of deep neural networks to capture the intricacies of human speech, producing natural-sounding voices and musical compositions with nuanced variations in tone, pitch, and rhythm [Ning et al., 2019]. Image generation, a focal point of our discussion, has witnessed the evolution of models such as DALL-E [Betker et al., 2023, Ramesh et al., 2021], MidJourney [mid, 2022], and Stable Diffusion [Rombach et al., 2022], which can generate diverse and intricate images from textual prompts.