Goto

Collaborating Authors

 Susladkar, Onkar


D2Styler: Advancing Arbitrary Style Transfer with Discrete Diffusion Methods

arXiv.org Artificial Intelligence

In image processing, one of the most challenging tasks is to render an image's semantic meaning using a variety of artistic approaches. Existing techniques for arbitrary style transfer (AST) frequently experience mode-collapse, over-stylization, or under-stylization due to a disparity between the style and content images. We propose a novel framework called D$^2$Styler (Discrete Diffusion Styler) that leverages the discrete representational capability of VQ-GANs and the advantages of discrete diffusion, including stable training and avoidance of mode collapse. Our method uses Adaptive Instance Normalization (AdaIN) features as a context guide for the reverse diffusion process. This makes it easy to move features from the style image to the content image without bias. The proposed method substantially enhances the visual quality of style-transferred images, allowing the combination of content and style in a visually appealing manner. We take style images from the WikiArt dataset and content images from the COCO dataset. Experimental results demonstrate that D$^2$Styler produces high-quality style-transferred images and outperforms twelve existing methods on nearly all the metrics. The qualitative results and ablation studies provide further insights into the efficacy of our technique. The code is available at https://github.com/Onkarsus13/D2Styler.


Ethical Framework for Responsible Foundational Models in Medical Imaging

arXiv.org Artificial Intelligence

Foundational models (FMs) have tremendous potential to revolutionize medical imaging. However, their deployment in real-world clinical settings demands extensive ethical considerations. This paper aims to highlight the ethical concerns related to FMs and propose a framework to guide their responsible development and implementation within medicine. We meticulously examine ethical issues such as privacy of patient data, bias mitigation, algorithmic transparency, explainability and accountability. The proposed framework is designed to prioritize patient welfare, mitigate potential risks, and foster trust in AI-assisted healthcare.


MOVES: Movable and Moving LiDAR Scene Segmentation in Label-Free settings using Static Reconstruction

arXiv.org Artificial Intelligence

Accurate static structure reconstruction and segmentation of non-stationary objects is of vital importance for autonomous navigation applications. These applications assume a LiDAR scan to consist of only static structures. In the real world however, LiDAR scans consist of non-stationary dynamic structures - moving and movable objects. Current solutions use segmentation information to isolate and remove moving structures from LiDAR scan. This strategy fails in several important use-cases where segmentation information is not available. In such scenarios, moving objects and objects with high uncertainty in their motion i.e. movable objects, may escape detection. This violates the above assumption. We present MOVES, a novel GAN based adversarial model that segments out moving as well as movable objects in the absence of segmentation information. We achieve this by accurately transforming a dynamic LiDAR scan to its corresponding static scan. This is obtained by replacing dynamic objects and corresponding occlusions with static structures which were occluded by dynamic objects. We leverage corresponding static-dynamic LiDAR pairs.


Towards Scene-Text to Scene-Text Translation

arXiv.org Artificial Intelligence

In this work, we study the task of ``visually" translating scene text from a source language (e.g., English) to a target language (e.g., Chinese). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the text, such as font, size, and background. There are several challenges associated with this task, such as interpolating font to unseen characters and preserving text size and the background. To address these, we introduce VTNet, a novel conditional diffusion-based method. To train the VTNet, we create a synthetic cross-lingual dataset of 600K samples of scene text images in six popular languages, including English, Hindi, Tamil, Chinese, Bengali, and German. We evaluate the performance of VTnet through extensive experiments and comparisons to related methods. Our model also surpasses the previous state-of-the-art results on the conventional scene-text editing benchmarks. Further, we present rigorous qualitative studies to understand the strengths and shortcomings of our model. Results show that our approach generalizes well to unseen words and fonts. We firmly believe our work can benefit real-world applications, such as text translation using a phone camera and translating educational materials. Code and data will be made publicly available.