Goto

Collaborating Authors

 Kim, Taehoon


Advancing ALS Applications with Large-Scale Pre-training: Dataset Development and Downstream Assessment

arXiv.org Artificial Intelligence

The pre-training and fine-tuning paradigm has revolutionized satellite remote sensing applications. However, this approach remains largely underexplored for airborne laser scanning (ALS), an important technology for applications such as forest management and urban planning. In this study, we address this gap by constructing a large-scale ALS point cloud dataset and evaluating its impact on downstream applications. Our dataset comprises ALS point clouds collected across the contiguous United States, provided by the United States Geological Survey's 3D Elevation Program. To ensure efficient data collection while capturing diverse land cover and terrain types, we introduce a geospatial sampling method that selects point cloud tiles based on land cover maps and digital elevation models. As a baseline self-supervised learning model, we adopt BEV-MAE, a state-of-the-art masked autoencoder for 3D outdoor point clouds, and pre-train it on the constructed dataset. The pre-trained models are subsequently fine-tuned for downstream tasks, including tree species classification, terrain scene recognition, and point cloud semantic segmentation. Our results show that the pre-trained models significantly outperform their scratch counterparts across all downstream tasks, demonstrating the transferability of the representations learned from the proposed dataset. Furthermore, we observe that scaling the dataset using our geospatial sampling method consistently enhances performance, whereas pre-training on datasets constructed with random sampling fails to achieve similar improvements. These findings highlight the utility of the constructed dataset and the effectiveness of our sampling strategy in the pre-training and fine-tuning paradigm. The source code and pre-trained models will be made publicly available at \url{https://github.com/martianxiu/ALS_pretraining}.


Exploiting Style Latent Flows for Generalizing Deepfake Video Detection

arXiv.org Artificial Intelligence

This paper presents a new approach for the detection of fake videos, based on the analysis of style latent vectors and their abnormal behavior in temporal changes in the generated videos. We discovered that the generated facial videos suffer from the temporal distinctiveness in the temporal changes of style latent vectors, which are inevitable during the generation of temporally stable videos with various facial expressions and geometric transformations. Our framework utilizes the StyleGRU module, trained by contrastive learning, to represent the dynamic properties of style latent vectors. Additionally, we introduce a style attention module that integrates StyleGRU-generated features with content-based features, enabling the detection of visual and temporal artifacts. We demonstrate our approach across various benchmark scenarios in deepfake detection, showing its superiority in cross-dataset and cross-manipulation scenarios. Through further analysis, we also validate the importance of using temporal changes of style latent vectors to improve the generality of deepfake video detection.


D3T: Distinctive Dual-Domain Teacher Zigzagging Across RGB-Thermal Gap for Domain-Adaptive Object Detection

arXiv.org Artificial Intelligence

Domain adaptation for object detection typically entails transferring knowledge from one visible domain to another visible domain. However, there are limited studies on adapting from the visible to the thermal domain, because the domain gap between the visible and thermal domains is much larger than expected, and traditional domain adaptation can not successfully facilitate learning in this situation. To overcome this challenge, we propose a Distinctive Dual-Domain Teacher (D3T) framework that employs distinct training paradigms for each domain. Specifically, we segregate the source and target training sets for building dual-teachers and successively deploy exponential moving average to the student model to individual teachers of each domain. The framework further incorporates a zigzag learning method between dual teachers, facilitating a gradual transition from the visible to thermal domains during training. We validate the superiority of our method through newly designed experimental protocols with well-known thermal datasets, i.e., FLIR and KAIST. Source code is available at https://github.com/EdwardDo69/D3T .


Cross-Class Feature Augmentation for Class Incremental Learning

arXiv.org Artificial Intelligence

By leveraging the representations learned in the past, we aim to augment the features Recent deep learning techniques have shown remarkable at each incremental stage to address data deficiency in progress in various computer vision tasks including image the classes belonging to old tasks. To this end, inspired by classification (He et al. 2016; Hu, Shen, and Sun 2018), object adversarial attacks, we adjust the feature representations of detection (Liu et al. 2016; Redmon et al. 2016; Zhu et al. training examples to resemble representations from specific 2021c), semantic segmentation (Chen et al. 2017; Long, target classes that are different from their original classes. Shelhamer, and Darrell 2015; Noh, Hong, and Han 2015), These perturbed features allow a new classifier to maintain and many others. Behind this success is an implicit assumption the decision boundaries for the classes learned up to the that the whole dataset with a predefined set of classes previous stages. Note that this is a novel perspective different should be given in a batch. However, this assumption is from conventional adversarial attack methods (Carlini unlikely to hold in the real-world scenarios which change and Wagner 2017; Goodfellow, Shlens, and Szegedy 2017; dynamically over time. This limits the applicability to realworld Madry et al. 2018; Moosavi-Dezfooli, Fawzi, and Frossard problems because deep neural networks trained under 2016; Zhao, Dua, and Singh 2018), which focus on deceiving changing data distribution often suffer from catastrophic forgetting, models. One may consider generating additional features meaning that the models lose the ability to maintain for each class using the exemplars with the same class labels.


Randomized Adversarial Style Perturbations for Domain Generalization

arXiv.org Artificial Intelligence

We propose a novel domain generalization technique, referred to as Randomized Adversarial Style Perturbation (RASP), which is motivated by the observation that the characteristics of each domain are captured by the feature statistics corresponding to style. The proposed algorithm perturbs the style of a feature in an adversarial direction towards a randomly selected class, and makes the model learn against being misled by the unexpected styles observed in unseen target domains. While RASP is effective to handle domain shifts, its naive integration into the training procedure might degrade the capability of learning knowledge from source domains because it has no restriction on the perturbations of representations. This challenge is alleviated by Normalized Feature Mixup (NFM), which facilitates the learning of the original features while achieving robustness to perturbed representations via their mixup during training. We evaluate the proposed algorithm via extensive experiments on various benchmarks and show that our approach improves domain generalization performance, especially in large-scale benchmarks.


Large-Scale Bidirectional Training for Zero-Shot Image Captioning

arXiv.org Artificial Intelligence

When trained on large-scale datasets, image captioning models can understand the content of images from a general domain but often fail to generate accurate, detailed captions. To improve performance, pretraining-and-finetuning has been a key strategy for image captioning. However, we find that large-scale bidirectional training between image and text enables zero-shot image captioning. In this paper, we introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning. We also propose a new evaluation benchmark which comprises of high quality datasets and an extensive set of metrics to properly evaluate zero-shot captioning accuracy and societal bias. We additionally provide an efficient finetuning approach for keyword extraction. We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.


RAMiT: Reciprocal Attention Mixing Transformer for Lightweight Image Restoration

arXiv.org Artificial Intelligence

Although many recent works have made advancements in the image restoration (IR) field, they often suffer from an excessive number of parameters. Another issue is that most Transformer-based IR methods focus only on either local or global features, leading to limited receptive fields or deficient parameter issues. To address these problems, we propose a lightweight IR network, Reciprocal Attention Mixing Transformer (RAMiT). It employs our proposed dimensional reciprocal attention mixing Transformer (D-RAMiT) blocks, which compute bi-dimensional (spatial and channel) self-attentions in parallel with different numbers of multi-heads. The bi-dimensional attentions help each other to complement their counterpart's drawbacks and are then mixed. Additionally, we introduce a hierarchical reciprocal attention mixing (H-RAMi) layer that compensates for pixel-level information losses and utilizes semantic information while maintaining an efficient hierarchical structure. Furthermore, we revisit and modify MobileNet V1 and V2 to attach efficient convolutions to our proposed components. The experimental results demonstrate that RAMiT achieves state-of-the-art performance on multiple lightweight IR tasks, including super-resolution, color denoising, grayscale denoising, low-light enhancement, and deraining. Codes are available at https://github.com/rami0205/RAMiT.


StatAssist & GradBoost: A Study on Optimal INT8 Quantization-aware Training from Scratch

arXiv.org Machine Learning

This paper studies the scratch training of quantization-aware training (QAT), which has been applied to the lossless conversion of lower-bit, especially for INT8 quantization. Due to its training instability, QAT have required a full-precision (FP) pre-trained weight for fine-tuning and the performance is bound to the original FP model with floating-point computations. Here, we propose critical but straightforward optimization methods which enable the scratch training: floating-point statistic assisting (StatAssist) and stochastic-gradient boosting (GradBoost). We discovered that, first, the scratch QAT get comparable and often surpasses the performance of the floating-point counterpart without any help of the pre-trained model, especially when the model becomes complicated.We also show that our method can even train the minimax generation loss, which is very unstable and hence difficult to apply QAT fine-tuning. From extent experiments, we show that our method successfully enables QAT to train various deep models from scratch: classification, object detection, semantic segmentation, and style transfer, with comparable or often better performance than their FP baselines.


Transfer Learning via Unsupervised Task Discovery for Visual Question Answering

arXiv.org Machine Learning

We study how to leverage off-the-shelf visual and linguistic data to cope with out-of-vocabulary answers in visual question answering task. Existing large-scale visual datasets with annotations such as image class labels, bounding boxes and region descriptions are good sources for learning rich and diverse visual concepts. However, it is not straightforward how the visual concepts can be captured and transferred to visual question answering models due to missing link between question dependent answering models and visual data without question. We tackle this problem in two steps: 1) learning a task conditional visual classifier, which is capable of solving diverse question-specific visual recognition tasks, based on unsupervised task discovery and 2) transferring the task conditional visual classifier to visual question answering models. Specifically, we employ linguistic knowledge sources such as structured lexical database (e.g. WordNet) and visual descriptions for unsupervised task discovery, and transfer a learned task conditional visual classifier as an answering unit in a visual question answering model. We empirically show that the proposed algorithm generalizes to out-of-vocabulary answers successfully using the knowledge transferred from the visual dataset.


Quantifying Generalization in Reinforcement Learning

arXiv.org Machine Learning

Generalizing between tasks remains difficult for state of the art deep reinforcement learning(RL) algorithms. Although trained agents can solve complex tasks, they struggle to transfer their experience to new environments. Agents that have mastered ten levels in a video game often fail catastrophically when first encountering the eleventh. Humans can seamlessly generalize across such similar tasks, but this ability is largely absent in RL agents. In short, agents become overly specialized to the environments encountered during training. That RL agents are prone to overfitting is widely appreciated, yet the most common RL benchmarks still encourage training and evaluating on the same set of environments. We believe there is a need for more metrics that evaluate generalization by explicitly separating training and test environments. In the same spirit as the Sonic Benchmark (Nichol et al., 2018), we seek to better quantify an agent's ability to generalize.