Li, Yu-Jhe
Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach
Deng, Shijian, Zhao, Wentian, Li, Yu-Jhe, Wan, Kun, Miranda, Daniel, Kale, Ajinkya, Tian, Yapeng
Self-improvement in multimodal large language models (MLLMs) is crucial for enhancing their reliability and robustness. However, current methods often rely heavily on MLLMs themselves as judges, leading to high computational costs and potential pitfalls like reward hacking and model collapse. This paper introduces a novel, model-level judge-free self-improvement framework. Our approach employs a controlled feedback mechanism while eliminating the need for MLLMs in the verification loop. We generate preference learning pairs using a controllable hallucination mechanism and optimize data quality by leveraging lightweight, contrastive language-image encoders to evaluate and reverse pairs when necessary. Evaluations across public benchmarks and our newly introduced IC dataset designed to challenge hallucination control demonstrate that our model outperforms conventional techniques. We achieve superior precision and recall with significantly lower computational demands. This method offers an efficient pathway to scalable self-improvement in MLLMs, balancing performance gains with reduced resource requirements.
3D-Aware Encoding for Style-based Neural Radiance Fields
Li, Yu-Jhe, Xu, Tao, Wu, Bichen, Zheng, Ningyuan, Dai, Xiaoliang, Pumarola, Albert, Zhang, Peizhao, Vajda, Peter, Kitani, Kris
We tackle the task of NeRF inversion for style-based neural radiance fields, (e.g., StyleNeRF). In the task, we aim to learn an inversion function to project an input image to the latent space of a NeRF generator and then synthesize novel views of the original image based on the latent code. Compared with GAN inversion for 2D generative models, NeRF inversion not only needs to 1) preserve the identity of the input image, but also 2) ensure 3D consistency in generated novel views. This requires the latent code obtained from the single-view image to be invariant across multiple views. To address this new challenge, we propose a two-stage encoder for style-based NeRF inversion. In the first stage, we introduce a base encoder that converts the input image to a latent code. To ensure the latent code is view-invariant and is able to synthesize 3D consistent novel view images, we utilize identity contrastive learning to train the base encoder. Second, to better preserve the identity of the input image, we introduce a refining encoder to refine the latent code and add finer details to the output image. Importantly note that the novelty of this model lies in the design of its first-stage encoder which produces the closest latent code lying on the latent manifold and thus the refinement in the second stage would be close to the NeRF manifold. Through extensive experiments, we demonstrate that our proposed two-stage encoder qualitatively and quantitatively exhibits superiority over the existing encoders for inversion in both image reconstruction and novel-view rendering.
Deep Reinforcement Learning for Playing 2.5D Fighting Games
Li, Yu-Jhe, Chang, Hsin-Yu, Lin, Yu-Jing, Wu, Po-Wei, Wang, Yu-Chiang Frank
Deep reinforcement learning has shown its success in game playing. However, 2.5D fighting games would be a challenging task to handle due to ambiguity in visual appearances like height or depth of the characters. Moreover, actions in such games typically involve particular sequential action orders, which also makes the network design very difficult. Based on the network of Asynchronous Advantage Actor-Critic (A3C), we create an OpenAI-gym-like gaming environment with the game of Little Fighter 2 (LF2), and present a novel A3C+ network for learning RL agents. The introduced model includes a Recurrent Info network, which utilizes game-related info features with recurrent layers to observe combo skills for fighting. In the experiments, we consider LF2 in different settings, which successfully demonstrates the use of our proposed model for learning 2.5D fighting games.
Deep Learning for Malicious Flow Detection
Chen, Yun-Chun, Li, Yu-Jhe, Tseng, Aragorn, Lin, Tsungnan
Cyber security has grown up to be a hot issue in recent years. How to identify potential malware becomes a challenging task. To tackle this challenge, we adopt deep learning approaches and perform flow detection on real data. However, real data often encounters an issue of imbalanced data distribution which will lead to a gradient dilution issue. When training a neural network, this problem will not only result in a bias toward the majority class but show the inability to learn from the minority classes. In this paper, we propose an end-to-end trainable Tree-Shaped Deep Neural Network (TSDNN) which classifies the data in a layer-wise manner. To better learn from the minority classes, we propose a Quantity Dependent Backpropagation (QDBP) algorithm which incorporates the knowledge of the disparity between classes. We evaluate our method on an imbalanced data set. Experimental result demonstrates that our approach outperforms the state-of-the-art methods and justifies that the proposed method is able to overcome the difficulty of imbalanced learning. We also conduct a partial flow experiment which shows the feasibility of real-time detection and a zero-shot learning experiment which justifies the generalization capability of deep learning in cyber security.