Gupta, Kamal
Measuring Style Similarity in Diffusion Models
Somepalli, Gowthami, Gupta, Anubhav, Gupta, Kamal, Palta, Shramay, Goldblum, Micah, Geiping, Jonas, Shrivastava, Abhinav, Goldstein, Tom
Generative models are now widely used by graphic designers and artists. Prior works have shown that these models remember and often replicate content from their training data during generation. Hence as their proliferation increases, it has become important to perform a database search to determine whether the properties of the image are attributable to specific training data, every time before a generated image is used for professional purposes. Existing tools for this purpose focus on retrieving images of similar semantic content. Meanwhile, many artists are concerned with style replication in text-to-image models. We present a framework for understanding and extracting style descriptors from images. Our framework comprises a new dataset curated using the insight that style is a subjective property of an image that captures complex yet meaningful interactions of factors including but not limited to colors, textures, shapes, etc. We also propose a method to extract style descriptors that can be used to attribute style of a generated image to the images used in the training dataset of a text-to-image model. We showcase promising results in various style retrieval tasks. We also quantitatively and qualitatively analyze style attribution and matching in the Stable Diffusion model. Code and artifacts are available at https://github.com/learn2phoenix/CSD.
Toward Scalable Visual Servoing Using Deep Reinforcement Learning and Optimal Control
Asayesh, Salar, Darani, Hossein Sheikhi, chen, Mo, Mehrandezh, Mehran, Gupta, Kamal
Classical pixel-based Visual Servoing (VS) approaches offer high accuracy but suffer from a limited convergence area due to optimization nonlinearity. Modern deep learning-based VS methods overcome traditional vision issues but lack scalability, requiring training on limited scenes. This paper proposes a hybrid VS strategy utilizing Deep Reinforcement Learning (DRL) and optimal control to enhance both convergence area and scalability. The DRL component of our approach separately handles representation and policy learning to enhance scalability, generalizability, learning efficiency and ease domain adaptation. Moreover, the optimal control part ensures high end-point accuracy. Our method showcases remarkable achievements in terms of high convergence rates and minimal end-positioning errors using a 7-DOF manipulator. Importantly, it exhibits scalability across more than 1000 distinct scenes. Furthermore, we demonstrate its capacity for generalization to previously unseen datasets. Lastly, we illustrate the real-world applicability of our approach, highlighting its adaptability through single-shot domain transfer learning in environments with noise and occlusions. Real-robot experiments can be found at \url{https://sites.google.com/view/vsls}.
SHACIRA: Scalable HAsh-grid Compression for Implicit Neural Representations
Girish, Sharath, Shrivastava, Abhinav, Gupta, Kamal
Implicit Neural Representations (INR) or neural fields have emerged as a popular framework to encode multimedia signals such as images and radiance fields while retaining high-quality. Recently, learnable feature grids proposed by Instant-NGP have allowed significant speed-up in the training as well as the sampling of INRs by replacing a large neural network with a multi-resolution look-up table of feature vectors and a much smaller neural network. However, these feature grids come at the expense of large memory consumption which can be a bottleneck for storage and streaming applications. In this work, we propose SHACIRA, a simple yet effective task-agnostic framework for compressing such feature grids with no additional post-hoc pruning/quantization stages. We reparameterize feature grids with quantized latent weights and apply entropy regularization in the latent space to achieve high levels of compression across various domains. Quantitative and qualitative results on diverse datasets consisting of images, videos, and radiance fields, show that our approach outperforms existing INR approaches without the need for any large datasets or domain-specific heuristics. Our project page is available at http://shacira.github.io .
Teaching Matters: Investigating the Role of Supervision in Vision Transformers
Walmer, Matthew, Suri, Saksham, Gupta, Kamal, Shrivastava, Abhinav
Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, their behavior under different learning paradigms is not well explored. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads that attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find that contrastive self-supervised methods learn features that are competitive with explicitly supervised features, and they can even be superior for part-level tasks. We also find that the representations of reconstruction-based models show non-trivial similarity to contrastive self-supervised models. Project website (https://www.cs.umd.edu/~sakshams/vit_analysis) and code (https://www.github.com/mwalmer-umd/vit_analysis) are publicly available.
ASIC: Aligning Sparse in-the-wild Image Collections
Gupta, Kamal, Jampani, Varun, Esteves, Carlos, Shrivastava, Abhinav, Makadia, Ameesh, Snavely, Noah, Kar, Abhishek
The above is also true for an image of a works assume either ground-truth keypoint annotations or "never-before-seen" object (as opposed to a common object a large dataset of images of a single object category. However, category such as cars) where humans demonstrate surprisingly neither of the above assumptions hold true for the longtail robust generalization despite lacking an object or category of the objects present in the world. We present a selfsupervised specific priors [6]. These correspondences in turn inform technique that directly optimizes on a sparse collection downstream inferences about the object such as shape, of images of a particular object/object category to affordances, and more. In this work, we tackle this problem obtain consistent dense correspondences across the collection. of "low-shot dense correspondence" - i.e. given only a small We use pairwise nearest neighbors obtained from deep in-the-wild image collection ( 10-30 images) of an object features of a pre-trained vision transformer (ViT) model as or object category, we recover dense and consistent correspondences noisy and sparse keypoint matches and make them dense across the entire collection.
Robot to Human Object Handover using Vision and Joint Torque Sensor Modalities
Mohandes, Mohammadhadi, Moradi, Behnam, Gupta, Kamal, Mehrandezh, Mehran
We present a robot-to-human object handover algorithm and implement it on a 7-DOF arm equipped with a 3-finger mechanical hand. The system performs a fully autonomous and robust object handover to a human receiver in real-time. Our algorithm relies on two complementary sensor modalities: joint torque sensors on the arm and an eye-in-hand RGB-D camera for sensor feedback. Our approach is entirely implicit, i.e., there is no explicit communication between the robot and the human receiver. Information obtained via the aforementioned sensor modalities is used as inputs to their related deep neural networks. While the torque sensor network detects the human receiver's "intention" such as: pull, hold, or bump, the vision sensor network detects if the receiver's fingers have wrapped around the object. Networks' outputs are then fused, based on which a decision is made to either release the object or not. Despite substantive challenges in sensor feedback synchronization, object, and human hand detection, our system achieves robust robot-to-human handover with 98\% accuracy in our preliminary real experiments using human receivers.