Goto

Collaborating Authors

 transformation network


STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

Lenzen, Nicholas, Raut, Amogh, Melnik, Andrew

arXiv.org Artificial Intelligence

Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and Audio-Video CLIP foundation model for Minecraft are made open-source to help foster further research into multi-modal generalist sequential decision-making agents.


MetaAug: Meta-Data Augmentation for Post-Training Quantization

Pham, Cuong, Dung, Hoang Anh, Nguyen, Cuong C., Le, Trung, Phung, Dinh, Carneiro, Gustavo, Do, Thanh-Toan

arXiv.org Artificial Intelligence

Post-Training Quantization (PTQ) has received significant attention because it requires only a small set of calibration data to quantize a full-precision model, which is more practical in real-world applications in which full access to a large training set is not available. However, it often leads to overfitting on the small calibration dataset. Several methods have been proposed to address this issue, yet they still rely on only the calibration set for the quantization and they do not validate the quantized model due to the lack of a validation set. In this work, we propose a novel meta-learning based approach to enhance the performance of post-training quantization. Specifically, to mitigate the overfitting problem, instead of only training the quantized model using the original calibration set without any validation during the learning process as in previous PTQ works, in our approach, we both train and validate the quantized model using two different sets of images. In particular, we propose a meta-learning based approach to jointly optimize a transformation network and a quantized model through bi-level optimization. The transformation network modifies the original calibration data and the modified data will be used as the training set to learn the quantized model with the objective that the quantized model achieves a good performance on the original calibration data. Extensive experiments on the widely used ImageNet dataset with different neural network architectures demonstrate that our approach outperforms the state-of-the-art PTQ methods. Code is available at this https URL.

  metaaug, quantization, transformation network, (12 more...)
2407.14726
  Country:
  Genre: Research Report > New Finding (0.47)

Explainable unsupervised multi-modal image registration using deep networks

Wang, Chengjia, Papanastasiou, Giorgos

arXiv.org Artificial Intelligence

Clinical decision making from magnetic resonance imaging (MRI) combines complementary information from multiple MRI sequences (defined as 'modalities'). MRI image registration aims to geometrically 'pair' diagnoses from different modalities, time points and slices. Both intra- and inter-modality MRI registration are essential components in clinical MRI settings. Further, an MRI image processing pipeline that can address both afine and non-rigid registration is critical, as both types of deformations may be occuring in real MRI data scenarios. Unlike image classification, explainability is not commonly addressed in image registration deep learning (DL) methods, as it is challenging to interpet model-data behaviours against transformation fields. To properly address this, we incorporate Grad-CAM-based explainability frameworks in each major component of our unsupervised multi-modal and multi-organ image registration DL methodology. We previously demonstrated that we were able to reach superior performance (against the current standard Syn method). In this work, we show that our DL model becomes fully explainable, setting the framework to generalise our approach on further medical imaging data.


What to Learn: Features, Image Transformations, or Both?

Chen, Yuxuan, Xu, Binbin, Dümbgen, Frederike, Barfoot, Timothy D.

arXiv.org Artificial Intelligence

Long-term visual localization is an essential problem in robotics and computer vision, but remains challenging due to the environmental appearance changes caused by lighting and seasons. While many existing works have attempted to solve it by directly learning invariant sparse keypoints and descriptors to match scenes, these approaches still struggle with adverse appearance changes. Recent developments in image transformations such as neural style transfer have emerged as an alternative to address such appearance gaps. In this work, we propose to combine an image transformation network and a feature-learning network to improve long-term localization performance. Given night-to-day image pairs, the image transformation network transforms the night images into day-like conditions prior to feature matching; the feature network learns to detect keypoint locations with their associated descriptor values, which can be passed to a classical pose estimator to compute the relative poses. We conducted various experiments to examine the effectiveness of combining style transfer and feature learning and its training strategy, showing that such a combination greatly improves long-term localization performance.


Planning with Learned Dynamic Model for Unsupervised Point Cloud Registration

Jiang, Haobo, Xie, Jin, Qian, Jianjun, Yang, Jian

arXiv.org Artificial Intelligence

Point cloud registration is a fundamental problem in 3D computer vision. In this paper, we cast point cloud registration into a planning problem in reinforcement learning, which can seek the transformation between the source and target point clouds through trial and error. By modeling the point cloud registration process as a Markov decision process (MDP), we develop a latent dynamic model of point clouds, consisting of a transformation network and evaluation network. The transformation network aims to predict the new transformed feature of the point cloud after performing a rigid transformation (i.e., action) on it while the evaluation network aims to predict the alignment precision between the transformed source point cloud and target point cloud as the reward signal. Once the dynamic model of the point cloud is trained, we employ the cross-entropy method (CEM) to iteratively update the planning policy by maximizing the rewards in the point cloud registration process. Thus, the optimal policy, i.e., the transformation between the source and target point clouds, can be obtained via gradually narrowing the search space of the transformation. Experimental results on ModelNet40 and 7Scene benchmark datasets demonstrate that our method can yield good registration performance in an unsupervised manner.


Sinogram Denoise Based on Generative Adversarial Networks

Chrysostomou, Charalambos

arXiv.org Artificial Intelligence

A novel method for sinogram denoise based on Generative Adversarial Networks (GANs) in the field of SPECT imaging is presented. Projection data from software phantoms were used to train the proposed model. For evaluation of the efficacy of the method Shepp Logan based phantom, with various noise levels added where used. The resulting denoised sinograms are reconstructed using Ordered Subset Expectation Maximization (OSEM) and compared to the reconstructions of the original noised sinograms. As the results show, the proposed method significantly denoise the sinograms and significantly improves the reconstructions. Finally, to demonstrate the efficacy and capability of the proposed method results from real-world DAT-SPECT sinograms are presented.


Fast and Restricted Style Transfer

#artificialintelligence

In their seminal work, "Image Style Transfer Using Convolutional Neural Networks," Gatys et al.[R1] demonstrate the efficacy of CNNs in separating and re-combining image content and style to create composite artistic images. Using feature extractions from intermediate layers of a pre-trained CNN, they define separate content and style loss functions, and pose the style transfer task as an optimization problem. We start from a random image and update the pixel values, such that the individual loss functions are minimized. For more details, please refer to this article. One obvious caveat of this approach is that it is slow.


On the Transformation of Latent Space in Autoencoders

Cha, Jaehoon, Kim, Kyeong Soo, Lee, Sanghyuk

arXiv.org Machine Learning

Noting the importance of the latent variables in inference and learning, we propose a novel framework for autoencoders based on the homeomorphic transformation of latent variables --- which could reduce the distance between vectors in the transformed space, while preserving the topological properties of the original space --- and investigate the effect of the transformation in both learning generative models and denoising corrupted data. The results of our experiments show that the proposed model can work as both a generative model and a denoising model with improved performance due to the transformation compared to conventional variational and denoising autoencoders.


Autoencoder Based Architecture For Fast & Real Time Audio Style Transfer

Ramani, Dhruv, Karmakar, Samarjit, Panda, Anirban, Ahmed, Asad, Tangri, Pratham

arXiv.org Machine Learning

Recently, there has been great interest in the field of audio style transfer, where a stylized audio is generated by imposing the style of a reference audio on the content of a target audio. We improve on the current approaches which use neural networks to extract the content and the style of the audio signal and propose a new autoencoder based architecture for the task. This network generates a stylized audio for a content audio in a single forward pass. The proposed network architecture proves to be advantageous over the quality of audio produced and the time taken to train the network. The network is experimented on speech signals to confirm the validity of our proposal.


Audiovisual speaker conversion: jointly and simultaneously transforming facial expression and acoustic characteristics

Fang, Fuming, Wang, Xin, Yamagishi, Junichi, Echizen, Isao

arXiv.org Machine Learning

An audiovisual speaker conversion method is presented for simultaneously transforming the facial expressions and voice of a source speaker into those of a target speaker. Transforming the facial and acoustic features together makes it possible for the converted voice and facial expressions to be highly correlated and for the generated target speaker to appear and sound natural. It uses three neural networks: a conversion network that fuses and transforms the facial and acoustic features, a waveform generation network that produces the waveform from both the converted facial and acoustic features, and an image reconstruction network that outputs an RGB facial image also based on both the converted features. The results of experiments using an emotional audiovisual database showed that the proposed method achieved significantly higher naturalness compared with one that separately transformed acoustic and facial features.