input sketch
KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models
Navard, Pouyan, Monsefi, Amin Karimi, Zhou, Mengxi, Chao, Wei-Lun, Yilmaz, Alper, Ramnath, Rajiv
Recent advances in diffusion models have significantly improved text-to-image (T2I) generation, but they often struggle to balance fine-grained precision with high-level control. Methods like ControlNet and T2I-Adapter excel at following sketches by seasoned artists but tend to be overly rigid, replicating unintentional flaws in sketches from novice users. Meanwhile, coarse-grained methods, such as sketch-based abstraction frameworks, offer more accessible input handling but lack the precise control needed for detailed, professional use. To address these limitations, we propose KnobGen, a dual-pathway framework that democratizes sketch-based image generation by seamlessly adapting to varying levels of sketch complexity and user skill. KnobGen uses a Coarse-Grained Controller (CGC) module for high-level semantics and a Fine-Grained Controller (FGC) module for detailed refinement. The relative strength of these two modules can be adjusted through our knob inference mechanism to align with the user's specific needs. These mechanisms ensure that KnobGen can flexibly generate images from both novice sketches and those drawn by seasoned artists. This maintains control over the final output while preserving the natural appearance of the image, as evidenced on the MultiGen-20M dataset and a newly collected sketch dataset.
- North America > United States > Ohio (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
VisioBlend: Sketch and Stroke-Guided Denoising Diffusion Probabilistic Model for Realistic Image Generation
Devmurari, Harshkumar, Kuckian, Gautham, Vishwakarma, Prajjwal, Vartak, Krunali
Generating images from hand-drawings is a crucial and fundamental task in content creation. The translation is challenging due to the infinite possibilities and the diverse expectations of users. However, traditional methods are often limited by the availability of training data. Therefore, VisioBlend, a unified framework supporting three-dimensional control over image synthesis from sketches and strokes based on diffusion models, is proposed. It enables users to decide the level of faithfulness to the input strokes and sketches. VisioBlend achieves state-of-the-art performance in terms of realism and flexibility, enabling various applications in image synthesis from sketches and strokes. It solves the problem of data availability by synthesizing new data points from hand-drawn sketches and strokes, enriching the dataset and enabling more robust and diverse image synthesis. This work showcases the power of diffusion models in image creation, offering a user-friendly and versatile approach for turning artistic visions into reality.
U-Sketch: An Efficient Approach for Sketch to Image Diffusion Models
Mitsouras, Ilias, Tsonis, Eleftherios, Tzouveli, Paraskevi, Voulodimos, Athanasios
Diffusion models have demonstrated remarkable performance in text-to-image synthesis, producing realistic and high resolution images that faithfully adhere to the corresponding text-prompts. Despite their great success, they still fall behind in sketch-to-image synthesis tasks, where in addition to text-prompts, the spatial layout of the generated images has to closely follow the outlines of certain reference sketches. Employing an MLP latent edge predictor to guide the spatial layout of the synthesized image by predicting edge maps at each denoising step has been recently proposed. Despite yielding promising results, the pixel-wise operation of the MLP does not take into account the spatial layout as a whole, and demands numerous denoising iterations to produce satisfactory images, leading to time inefficiency. To this end, we introduce U-Sketch, a framework featuring a U-Net type latent edge predictor, which is capable of efficiently capturing both local and global features, as well as spatial correlations between pixels. Moreover, we propose the addition of a sketch simplification network that offers the user the choice of preprocessing and simplifying input sketches for enhanced outputs. The experimental results, corroborated by user feedback, demonstrate that our proposed U-Net latent edge predictor leads to more realistic results, that are better aligned with the spatial outlines of the reference sketches, while drastically reducing the number of required denoising steps and, consequently, the overall execution time.
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.75)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Breathing Life Into Sketches Using Text-to-Video Priors
Gal, Rinon, Vinker, Yael, Alaluf, Yuval, Bermano, Amit H., Cohen-Or, Daniel, Shamir, Ariel, Chechik, Gal
A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process, requiring extensive experience and professional design skills. In this work, we present a method that automatically adds motion to a single-subject sketch (hence, "breathing life into it"), merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation, which can be easily edited. Our method does not require extensive training, but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance, we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly, we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Asia > Indonesia > Sulawesi (0.04)
SENS: Sketch-based Implicit Neural Shape Modeling
Binninger, Alexandre, Hertz, Amir, Sorkine-Hornung, Olga, Cohen-Or, Daniel, Giryes, Raja
We present SENS, a novel method for generating and editing 3D models from hand-drawn sketches, including those of an abstract nature. Our method allows users to quickly and easily sketch a shape, and then maps the sketch into the latent space of a part-aware neural implicit shape architecture. SENS analyzes the sketch and encodes its parts into ViT patch encoding, then feeds them into a transformer decoder that converts them to shape embeddings, suitable for editing 3D neural implicit shapes. SENS not only provides intuitive sketch-based generation and editing, but also excels in capturing the intent of the user's sketch to generate a variety of novel and expressive 3D shapes, even from abstract sketches. We demonstrate the effectiveness of our model compared to the state-of-the-art using objective metric evaluation criteria and a decisive user study, both indicating strong performance on sketches with a medium level of abstraction. Furthermore, we showcase its intuitive sketch-based shape editing capabilities.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
DiffSketching: Sketch Control Image Synthesis with Diffusion Models
Wang, Qiang, Kong, Di, Lin, Fengyin, Qi, Yonggang
Creative sketch is a universal way of visual expression, but translating images from an abstract sketch is very challenging. Traditionally, creating a deep learning model for sketch-to-image synthesis needs to overcome the distorted input sketch without visual details, and requires to collect large-scale sketch-image datasets. We first study this task by using diffusion models. Our model matches sketches through the cross domain constraints, and uses a classifier to guide the image synthesis more accurately. Extensive experiments confirmed that our method can not only be faithful to user's input sketches, but also maintain the diversity and imagination of synthetic image results. Our model can beat GAN-based method in terms of generation quality and human evaluation, and does not rely on massive sketch-image datasets. Additionally, we present applications of our method in image editing and interpolation.
DiffFaceSketch: High-Fidelity Face Image Synthesis with Sketch-Guided Latent Diffusion Model
Peng, Yichen, Zhao, Chunqi, Xie, Haoran, Fukusato, Tsukasa, Miyata, Kazunori
Synthesizing face images from monochrome sketches is one of the most fundamental tasks in the field of image-to-image translation. However, it is still challenging to (1)~make models learn the high-dimensional face features such as geometry and color, and (2)~take into account the characteristics of input sketches. Existing methods often use sketches as indirect inputs (or as auxiliary inputs) to guide the models, resulting in the loss of sketch features or the alteration of geometry information. In this paper, we introduce a Sketch-Guided Latent Diffusion Model (SGLDM), an LDM-based network architect trained on the paired sketch-face dataset. We apply a Multi-Auto-Encoder (AE) to encode the different input sketches from different regions of a face from pixel space to a feature map in latent space, which enables us to reduce the dimension of the sketch input while preserving the geometry-related information of local face details. We build a sketch-face paired dataset based on the existing method that extracts the edge map from an image. We then introduce a Stochastic Region Abstraction (SRA), an approach to augment our dataset to improve the robustness of SGLDM to handle sketch input with arbitrary abstraction. The evaluation study shows that SGLDM can synthesize high-quality face images with different expressions, facial accessories, and hairstyles from various sketches with different abstraction levels.
- North America > United States > Washington > King County > Seattle (0.05)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- (8 more...)
MaskSketch: Unpaired Structure-guided Masked Image Generation
Bashkirova, Dina, Lezama, Jose, Sohn, Kihyuk, Saenko, Kate, Essa, Irfan
Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. MaskSketch utilizes a pre-trained masked generative transformer, requiring no model training or paired supervision, and works with input sketches of different levels of abstraction. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image, such as scene layout and object shape, and we propose a novel sampling method based on this observation to enable structure-guided generation. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure. Evaluated on standard benchmark datasets, MaskSketch outperforms state-of-the-art methods for sketch-to-image translation, as well as unpaired image-to-image translation approaches.
TreeSketchNet: From Sketch To 3D Tree Parameters Generation
Manfredi, Gilda, Capece, Nicola, Erra, Ugo, Gruosso, Monica
3D modeling of non-linear objects from stylized sketches is a challenge even for experts in Computer Graphics (CG). The extrapolation of objects parameters from a stylized sketch is a very complex and cumbersome task. In the present study, we propose a broker system that mediates between the modeler and the 3D modelling software and can transform a stylized sketch of a tree into a complete 3D model. The input sketches do not need to be accurate or detailed, and only need to represent a rudimentary outline of the tree that the modeler wishes to 3D-model. Our approach is based on a well-defined Deep Neural Network (DNN) architecture, we called TreeSketchNet (TSN), based on convolutions and able to generate Weber and Penn parameters that can be interpreted by the modelling software to generate a 3D model of a tree starting from a simple sketch. The training dataset consists of Synthetically-Generated \revision{(SG)} sketches that are associated with Weber-Penn parameters generated by a dedicated Blender modelling software add-on. The accuracy of the proposed method is demonstrated by testing the TSN with both synthetic and hand-made sketches. Finally, we provide a qualitative analysis of our results, by evaluating the coherence of the predicted parameters with several distinguishing features.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (16 more...)
A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
Sangkloy, Patsorn, Jitkrittum, Wittawat, Yang, Diyi, Hays, James
We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASK-former follows the late-fusion dual-encoder approach, similar to CLIP, which allows efficient and scalable retrieval since the retrieval set can be indexed independently of the queries. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval. To evaluate our approach, we collect 5,000 hand-drawn sketches for images in the test set of the COCO dataset. The collected sketches are available a https://janesjanes.github.io/tsbir/.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)