Rethinking Video Tokenization: A Conditioned Diffusion-based Approach

arXiv.org Artificial Intelligence

Existing video tokenizers typically use the traditional Variational Autoencoder (VAE) architecture for video compression and reconstruction. However, to achieve good performance, its training process often relies on complex multi-stage training tricks that go beyond basic reconstruction loss and KL regularization. Among these tricks, the most challenging is the precise tuning of adversarial training with additional Generative Adversarial Networks (GANs) in the final stage, which can hinder stable convergence. In contrast to GANs, diffusion models offer more stable training processes and can generate higher-quality results. Inspired by these advantages, we propose CDT, a novel Conditioned Diffusion-based video Tokenizer, that replaces the GAN-based decoder with a conditional causal diffusion model. The encoder compresses spatio-temporal information into compact latents, while the decoder reconstructs videos through a reverse diffusion process conditioned on these latents. During inference, we incorporate a feature cache mechanism to generate videos of arbitrary length while maintaining temporal continuity and adopt sampling acceleration technique to enhance efficiency. Trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch, extensive experiments demonstrate that CDT achieves state-of-the-art performance in video reconstruction tasks with just a single-step sampling. Even a scaled-down version of CDT (3$\times$ inference speedup) still performs comparably with top baselines. Moreover, the latent video generation model trained with CDT also exhibits superior performance. The source code and pretrained weights will be released shortly, so please stay tuned for updates!


ZO-DARTS++: An Efficient and Size-Variable Zeroth-Order Neural Architecture Search Algorithm

arXiv.org Artificial Intelligence

Differentiable Neural Architecture Search (NAS) provides a promising avenue for automating the complex design of deep learning (DL) models. However, current differentiable NAS methods often face constraints in efficiency, operation selection, and adaptability under varying resource limitations. We introduce ZO-DARTS++, a novel NAS method that effectively balances performance and resource constraints. By integrating a zeroth-order approximation for efficient gradient handling, employing a sparsemax function with temperature annealing for clearer and more interpretable architecture distributions, and adopting a size-variable search scheme for generating compact yet accurate architectures, ZO-DARTS++ establishes a new balance between model complexity and performance. In extensive tests on medical imaging datasets, ZO-DARTS++ improves the average accuracy by up to 1.8\% over standard DARTS-based methods and shortens search time by approximately 38.6\%. Additionally, its resource-constrained variants can reduce the number of parameters by more than 35\% while maintaining competitive accuracy levels. Thus, ZO-DARTS++ offers a versatile and efficient framework for generating high-quality, resource-aware DL models suitable for real-world medical applications.


On the Fly Adaptation of Behavior Tree-Based Policies through Reinforcement Learning

arXiv.org Artificial Intelligence

With the rising demand for flexible manufacturing, robots are increasingly expected to operate in dynamic environments where local disturbances--such as slight offsets or size differences in workpieces--are common. We propose to address the problem of adapting robot behaviors to these task variations with a sample-efficient hierarchical reinforcement learning approach adapting Behavior Tree (BT)-based policies. We maintain the core BT properties as an interpretable, modular framework for structuring reactive behaviors, but extend their use beyond static tasks by inherently accommodating local task variations. To show the efficiency and effectiveness of our approach, we conduct experiments both in simulation and on a Franka Emika Panda 7-DoF, with the manipulator adapting to different obstacle avoidance and pivoting tasks.


Generation of Optimized Solidity Code for Machine Learning Models using LLMs

arXiv.org Artificial Intelligence

While a plethora of machine learning (ML) models are currently available, along with their implementation on disparate platforms, there is hardly any verifiable ML code which can be executed on public blockchains. We propose a novel approach named LMST that enables conversion of the inferencing path of an ML model as well as its weights trained off-chain into Solidity code using Large Language Models (LLMs). Extensive prompt engineering is done to achieve gas cost optimization beyond mere correctness of the produced code, while taking into consideration the capabilities and limitations of the Ethereum Virtual Machine. We have also developed a proof of concept decentralized application using the code so generated for verifying the accuracy claims of the underlying ML model. An extensive set of experiments demonstrate the feasibility of deploying ML models on blockchains through automated code translation using LLMs.


FSDP: Fast and Safe Data-Driven Overtaking Trajectory Planning for Head-to-Head Autonomous Racing Competitions

arXiv.org Artificial Intelligence

Generating overtaking trajectories in autonomous racing is a challenging task, as the trajectory must satisfy the vehicle's dynamics and ensure safety and real-time performance running on resource-constrained hardware. This work proposes the Fast and Safe Data-Driven Planner to address this challenge. Sparse Gaussian predictions are introduced to improve both the computational efficiency and accuracy of opponent predictions. Furthermore, the proposed approach employs a bi-level quadratic programming framework to generate an overtaking trajectory leveraging the opponent predictions. The first level uses polynomial fitting to generate a rough trajectory, from which reference states and control inputs are derived for the second level. The second level formulates a model predictive control optimization problem in the Frenet frame, generating a trajectory that satisfies both kinematic feasibility and safety. Experimental results on the F1TENTH platform show that our method outperforms the State-of-the-Art, achieving an 8.93% higher overtaking success rate, allowing the maximum opponent speed, ensuring a smoother ego trajectory, and reducing 74.04% computational time compared to the Predictive Spliner method. The code is available at: https://github.com/ZJU-DDRX/FSDP.


A Survey of Sim-to-Real Methods in RL: Progress, Prospects and Challenges with Foundation Models

arXiv.org Artificial Intelligence

Deep Reinforcement Learning (RL) has been explored and verified to be effective in solving decision-making tasks in various domains, such as robotics, transportation, recommender systems, etc. It learns from the interaction with environments and updates the policy using the collected experience. However, due to the limited real-world data and unbearable consequences of taking detrimental actions, the learning of RL policy is mainly restricted within the simulators. This practice guarantees safety in learning but introduces an inevitable sim-to-real gap in terms of deployment, thus causing degraded performance and risks in execution. There are attempts to solve the sim-to-real problems from different domains with various techniques, especially in the era with emerging techniques such as large foundations or language models that have cast light on the sim-to-real. This survey paper, to the best of our knowledge, is the first taxonomy that formally frames the sim-to-real techniques from key elements of the Markov Decision Process (State, Action, Transition, and Reward). Based on the framework, we cover comprehensive literature from the classic to the most advanced methods including the sim-to-real techniques empowered by foundation models, and we also discuss the specialties that are worth attention in different domains of sim-to-real problems. Then we summarize the formal evaluation process of sim-to-real performance with accessible code or benchmarks. The challenges and opportunities are also presented to encourage future exploration of this direction. We are actively maintaining a repository to include the most up-to-date sim-to-real research work to help domain researchers.


Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding

arXiv.org Artificial Intelligence

Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of the model with the quality of the synthesized speech. Methods that generate high-quality synthesized speech tend to have slower inference speeds, while faster inference methods often sacrifice speech quality. In this paper, I propose Clip-TTS, a TTS method based on the Clip architecture. This method uses the Clip framework to establish a connection between text content and real mel-spectrograms during the text encoding stage, enabling the text encoder to directly learn the true semantics of the global context, thereby ensuring the quality of the synthesized speech. In terms of model architecture, I adopt the basic structure of Transformer, which allows Clip-TTS to achieve fast inference speeds. Experimental results show that on the LJSpeech and Baker datasets, the speech generated by Clip-TTS achieves state-of-the-art MOS scores, and it also performs excellently on multi-emotion datasets. Audio samples are available at: https://ltydd1314.github.io/.


EPR-GAIL: An EPR-Enhanced Hierarchical Imitation Learning Framework to Simulate Complex User Consumption Behaviors

arXiv.org Artificial Intelligence

User consumption behavior data, which records individuals' online spending history at various types of stores, has been widely used in various applications, such as store recommendation, site selection, and sale forecasting. However, its high worth is limited due to deficiencies in data comprehensiveness and changes of application scenarios. Thus, generating high-quality sequential consumption data by simulating complex user consumption behaviors is of great importance to real-world applications. Two branches of existing sequence generation methods are both limited in quality. Model-based methods with simplified assumptions fail to model the complex decision process of user consumption, while data-driven methods that emulate real-world data are prone to noises, unobserved behaviors, and dynamic decision space. In this work, we propose to enhance the fidelity and trustworthiness of the data-driven Generative Adversarial Imitation Learning (GAIL) method by blending it with the Exploration and Preferential Return EPR model . The core idea of our EPR-GAIL framework is to model user consumption behaviors as a complex EPR decision process, which consists of purchase, exploration, and preference decisions. Specifically, we design the hierarchical policy function in the generator as a realization of the EPR decision process and employ the probability distributions of the EPR model to guide the reward function in the discriminator. Extensive experiments on two real-world datasets of user consumption behaviors on an online platform demonstrate that the EPR-GAIL framework outperforms the best state-of-the-art baseline by over 19\% in terms of data fidelity. Furthermore, the generated consumption behavior data can improve the performance of sale prediction and location recommendation by up to 35.29% and 11.19%, respectively, validating its advantage for practical applications.


Single Domain Generalization with Adversarial Memory

arXiv.org Artificial Intelligence

Domain Generalization (DG) aims to train models that can generalize to unseen testing domains by leveraging data from multiple training domains. However, traditional DG methods rely on the availability of multiple diverse training domains, limiting their applicability in data-constrained scenarios. Single Domain Generalization (SDG) addresses the more realistic and challenging setting by restricting the training data to a single domain distribution. The main challenges in SDG stem from the limited diversity of training data and the inaccessibility of unseen testing data distributions. To tackle these challenges, we propose a single domain generalization method that leverages an adversarial memory bank to augment training features. Our memory-based feature augmentation network maps both training and testing features into an invariant subspace spanned by diverse memory features, implicitly aligning the training and testing domains in the projected space. To maintain a diverse and representative feature memory bank, we introduce an adversarial feature generation method that creates features extending beyond the training domain distribution. Experimental results demonstrate that our approach achieves state-of-the-art performance on standard single domain generalization benchmarks.


TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems

arXiv.org Artificial Intelligence

This paper focuses on multimodal alignment within the realm of Artificial Intelligence, particularly in text and image modalities. The semantic gap between the textual and visual modality poses a discrepancy problem towards the effectiveness of multi-modalities fusion. Therefore, we introduce Text-Image Joint Embedding Predictive Architecture (TI-JEPA), an innovative pre-training strategy that leverages energy-based model (EBM) framework to capture complex cross-modal relationships. TI-JEPA combines the flexibility of EBM in self-supervised learning to facilitate the compatibility between textual and visual elements. Through extensive experiments across multiple benchmarks, we demonstrate that TI-JEPA achieves state-of-the-art performance on multimodal sentiment analysis task (and potentially on a wide range of multimodal-based tasks, such as Visual Question Answering), outperforming existing pre-training methodologies. Our findings highlight the potential of using energy-based framework in advancing multimodal fusion and suggest significant improvements for downstream applications.