Hoai, Minh
HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
Narasimhaswamy, Supreeth, Bhattacharya, Uttaran, Chen, Xiang, Dasgupta, Ishita, Mitra, Saayan, Hoai, Minh
Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. To generate images with realistic hands, we propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts, and a Text-Guided Hand-Params-to-Image diffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands.
Count What You Want: Exemplar Identification and Few-shot Counting of Human Actions in the Wild
Huang, Yifeng, Nguyen, Duc Duy, Nguyen, Lam, Pham, Cuong, Hoai, Minh
This paper addresses the task of counting human actions of interest using sensor data from wearable devices. We propose a novel exemplar-based framework, allowing users to provide exemplars of the actions they want to count by vocalizing predefined sounds ''one'', ''two'', and ''three''. Our method first localizes temporal positions of these utterances from the audio sequence. These positions serve as the basis for identifying exemplars representing the action class of interest. A similarity map is then computed between the exemplars and the entire sensor data sequence, which is further fed into a density estimation module to generate a sequence of estimated density values. Summing these density values provides the final count. To develop and evaluate our approach, we introduce a diverse and realistic dataset consisting of real-world data from 37 subjects and 50 action categories, encompassing both sensor and audio data. The experiments on this dataset demonstrate the viability of the proposed method in counting instances of actions from new classes and subjects that were not part of the training data. On average, the discrepancy between the predicted count and the ground truth value is 7.47, significantly lower than the errors of the frequency-based and transformer-based methods. Our project, code and dataset can be found at https://github.com/cvlab-stonybrook/ExRAC.
HyperCUT: Video Sequence from a Single Blurry Image using Unsupervised Ordering
Pham, Bang-Dang, Tran, Phong, Tran, Anh, Pham, Cuong, Nguyen, Rang, Hoai, Minh
We consider the challenging task of training models for image-to-video deblurring, which aims to recover a sequence of sharp images corresponding to a given blurry image input. A critical issue disturbing the training of an image-to-video model is the ambiguity of the frame ordering since both the forward and backward sequences are plausible solutions. This paper proposes an effective self-supervised ordering scheme that allows training high-quality image-to-video deblurring models. Unlike previous methods that rely on order-invariant losses, we assign an explicit order for each video sequence, thus avoiding the order-ambiguity issue. Specifically, we map each video sequence to a vector in a latent high-dimensional space so that there exists a hyperplane such that for every video sequence, the vectors extracted from it and its reversed sequence are on different sides of the hyperplane. The side of the vectors will be used to define the order of the corresponding sequence. Last but not least, we propose a real-image dataset for the image-to-video deblurring problem that covers a variety of popular domains, including face, hand, and street. Extensive experimental results confirm the effectiveness of our method. Code and data are available at https://github.com/VinAIResearch/HyperCUT.git
Predicting Human Attention using Computational Attention
Yang, Zhibo, Mondal, Sounak, Ahn, Seoyoung, Zelinsky, Gregory, Hoai, Minh, Samaras, Dimitris
Most models of visual attention are aimed at predicting either top-down or bottom-up control, as studied using different visual search and free-viewing tasks. We propose Human Attention Transformer (HAT), a single model predicting both forms of attention control. HAT is the new state-of-the-art (SOTA) in predicting the scanpath of fixations made during target-present and target-absent search, and matches or exceeds SOTA in the prediction of taskless free-viewing fixation scanpaths. HAT achieves this new SOTA by using a novel transformer-based architecture and a simplified foveated retina that collectively create a spatio-temporal awareness akin to the dynamic visual working memory of humans. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization, HAT features a dense-prediction architecture and outputs a dense heatmap for each fixation, thus avoiding discretizing fixations. HAT sets a new standard in computational attention, which emphasizes both effectiveness and generality. HAT's demonstrated scope and applicability will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios.
Text-Derived Knowledge Helps Vision: A Simple Cross-modal Distillation for Video-based Action Anticipation
Ghosh, Sayontan, Aggarwal, Tanvi, Hoai, Minh, Balasubramanian, Niranjan
Anticipating future actions in a video is useful for many autonomous and assistive technologies. Most prior action anticipation work treat this as a vision modality problem, where the models learn the task information primarily from the video features in the action anticipation datasets. However, knowledge about action sequences can also be obtained from external textual data. In this work, we show how knowledge in pretrained language models can be adapted and distilled into vision-based action anticipation models. We Figure 1: A model learning the action anticipation from show that a simple distillation technique can only the vision modality (video frames) is essentially achieve effective knowledge transfer and provide exposed to a very limited set of action sequences. Language consistent gains on a strong vision model models, which are pre-trained on large-scale text, (Anticipative Vision Transformer) for two action can learn this distribution from the task, and a much anticipation datasets (3.5% relative gain larger domain-relevant text. We propose distilling this on EGTEA-GAZE+ and 7.2% relative gain on knowledge from text modality models to vision modality EPIC-KITCHEN 55), giving a new state-of-theart model for video action anticipation task.
Progressive Semantic Segmentation
Huynh, Chuong, Tran, Anh, Luu, Khoa, Hoai, Minh
The objective of this work is to segment high-resolution images without overloading GPU memory usage or losing the fine details in the output segmentation map. The memory constraint means that we must either downsample the big image or divide the image into local patches for separate processing. However, the former approach would lose the fine details, while the latter can be ambiguous due to the lack of a global picture. In this work, we present MagNet, a multi-scale framework that resolves local ambiguity by looking at the image at multiple magnification levels. MagNet has multiple processing stages, where each stage corresponds to a magnification level, and the output of one stage is fed into the next stage for coarse-to-fine information propagation. Each stage analyzes the image at a higher resolution than the previous stage, recovering the previously lost details due to the lossy downsampling step, and the segmentation output is progressively refined through the processing stages. Experiments on three high-resolution datasets of urban views, aerial scenes, and medical images show that MagNet consistently outperforms the state-of-the-art methods by a significant margin.
Explore Image Deblurring via Blur Kernel Space
Tran, Phong, Tran, Anh, Phung, Quynh, Hoai, Minh
This paper introduces a method to encode the blur operators of an arbitrary dataset of sharp-blur image pairs into a blur kernel space. Assuming the encoded kernel space is close enough to in-the-wild blur operators, we propose an alternating optimization algorithm for blind image deblurring. It approximates an unseen blur operator by a kernel in the encoded space and searches for the corresponding sharp image. Unlike recent deep-learning-based methods, our system can handle unseen blur kernel, while avoiding using complicated handcrafted priors on the blur operator often found in classical methods. Due to the method's design, the encoded kernel space is fully differentiable, thus can be easily adopted in deep neural network models. Moreover, our method can be used for blur synthesis by transferring existing blur operators from a given dataset into a new domain. Finally, we provide experimental results to confirm the effectiveness of the proposed method.
Visual Understanding of Multiple Attributes Learning Model of X-Ray Scattering Images
Huang, Xinyi, Jamonnak, Suphanut, Zhao, Ye, Wang, Boyu, Hoai, Minh, Yager, Kevin, Xu, Wei
The technique is widely used in biomedical, material, and physical applications by analyzing structural patterns in the x-ray scattering images [21]. X-ray equipment can generate up to 1 million images per day which impose heavy burden in post image analysis. A variety of image analysis methods are applied to x-ray scattering data. Recently, deep learning models are employed in classifying and annotating multiple image attributes from experimental or synthetic images, which were shown to outperform previously published methods [18, 4]. As most deep learning paradigms, these methods are not easily understood by material, physical, and biomedical scientists. The lack of proper explanations and absence of control of the decisions would make the models less trustworthy. While considerable effort has been made to make deep learning interpretable and controllable by humans [3], the existing techniques are not specifically designed for the scientific image classification models of x-ray scattering images, which requires extra consideration in finding - How the learning models perform for a diverse set of overlapped attributes with high variation?