Goto

Collaborating Authors

 Zhao, Yucheng


Reconstructive Visual Instruction Tuning

arXiv.org Artificial Intelligence

This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.


SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control

arXiv.org Artificial Intelligence

Autonomous driving progress relies on large-scale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities of freely-labeled data for autonomous driving applications and present SubjectDrive, the first model proven to scale generative data production in a way that could continuously improve autonomous driving applications. We investigate the impact of scaling up the quantity of generative data on the performance of downstream perception models and find that enhancing data diversity plays a crucial role in effectively scaling generative data production. Therefore, we have developed a novel model equipped with a subject control mechanism, which allows the generative model to leverage diverse external data sources for producing varied and useful data. Extensive evaluations confirm SubjectDrive's efficacy in generating scalable autonomous driving training data, marking a significant step toward revolutionizing data production methods in this field.


Simulating Nighttime Visible Satellite Imagery of Tropical Cyclones Using Conditional Generative Adversarial Networks

arXiv.org Artificial Intelligence

Visible (VIS) imagery of satellites has various important applications in meteorology, including monitoring Tropical Cyclones (TCs). However, it is unavailable at night because of the lack of sunlight. This study presents a Conditional Generative Adversarial Networks (CGAN) model that generates highly accurate nighttime visible reflectance using infrared (IR) bands and sunlight direction parameters as input. The model was trained and validated using target area observations of the Advanced Himawari Imager (AHI) in the daytime. This study also presents the first nighttime model validation using the Day/Night Band (DNB) of the Visible/Infrared Imager Radiometer Suite (VIIRS). The daytime statistical results of the Structural Similarity Index Measure (SSIM), Peak Signal-to-Noise Ratio (PSNR), Root Mean Square Error (RMSE), Correlation Coefficient (CC), and Bias are 0.885, 28.3, 0.0428, 0.984, and -0.0016 respectively, completely surpassing the model performance of previous studies. The nighttime statistical results of SSIM, PSNR, RMSE, and CC are 0.821, 24.4, 0.0643, and 0.969 respectively, which are slightly negatively impacted by the parallax between satellites. We performed full-disk model validation which proves our model could also be readily applied in the tropical ocean without TCs in the northern hemisphere. This model contributes to the nighttime monitoring of meteorological phenomena by providing accurate AI-generated visible imagery with adjustable virtual sunlight directions.


ADriver-I: A General World Model for Autonomous Driving

arXiv.org Artificial Intelligence

Typically, autonomous driving adopts a modular design, which divides the full stack into perception, prediction, planning and control parts. Though interpretable, such modular design tends to introduce a substantial amount of redundancy. Recently, multimodal large language models (MLLM) and diffusion techniques have demonstrated their superior performance on comprehension and generation ability. In this paper, we first introduce the concept of interleaved vision-action pair, which unifies the format of visual features and control signals. Based on the vision-action pairs, we construct a general world model based on MLLM and diffusion model for autonomous driving, termed ADriver-I. It takes the vision-action pairs as inputs and autoregressively predicts the control signal of the current frame. The generated control signals together with the historical vision-action pairs are further conditioned to predict the future frames. With the predicted next frame, ADriver-I performs further control signal prediction. Such a process can be repeated infinite times, ADriver-I achieves autonomous driving in the world created by itself. Extensive experiments are conducted on nuScenes and our large-scale private datasets. ADriver-I shows impressive performance compared to several constructed baselines. We hope our ADriver-I can provide some new insights for future autonomous driving and embodied intelligence.


Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

arXiv.org Artificial Intelligence

Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript. Existing methods adopt a two-stage approach: synthesize the input text using a generic text-to-speech (TTS) engine and then transform the voice to the desired voice using voice conversion (VC). A major problem of this framework is that VC is a challenging problem which usually needs a moderate amount of parallel training data to work satisfactorily. In this paper, we propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the target speaker. In particular, we manage to perform accurate zero-shot duration prediction for the inserted text. The predicted duration is used to regulate both text embedding and speech embedding. Then, based on the aligned cross-modality input, we directly generate the mel-spectrogram of the edited speech with a transformer-based decoder. Subjective listening tests show that despite the lack of training data for the speaker, our method has achieved satisfactory results. It outperforms a recent zero-shot TTS engine by a large margin.