Gao, Zhi
MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge
Du, Yuntao, Jiang, Kailin, Gao, Zhi, Shi, Chenrui, Zheng, Zilong, Qi, Siyuan, Li, Qing
Knowledge editing techniques have emerged as essential tools for updating the factual knowledge of large language models (LLMs) and multimodal models (LMMs), allowing them to correct outdated or inaccurate information without retraining from scratch. However, existing benchmarks for multimodal knowledge editing primarily focus on entity-level knowledge represented as simple triplets, which fail to capture the complexity of real-world multimodal information. To address this issue, we introduce MMKE-Bench, a comprehensive MultiModal Knowledge Editing Benchmark, designed to evaluate the ability of LMMs to edit diverse visual knowledge in real-world scenarios. MMKE-Bench addresses these limitations by incorporating three types of editing tasks: visual entity editing, visual semantic editing, and user-specific editing. Besides, MMKE-Bench uses free-form natural language to represent and edit knowledge, offering a more flexible and effective format. The benchmark consists of 2,940 pieces of knowledge and 8,363 images across 33 broad categories, with evaluation questions automatically generated and human-verified. We assess five state-of-the-art knowledge editing methods on three prominent LMMs, revealing that no method excels across all criteria, and that visual and user-specific edits are particularly challenging. MMKE-Bench sets a new standard for evaluating the robustness of multimodal knowledge editing techniques, driving progress in this rapidly evolving field.
Large-Scale Riemannian Meta-Optimization via Subspace Adaptation
Yu, Peilin, Wu, Yuwei, Gao, Zhi, Fan, Xiaomeng, Jia, Yunde
Riemannian meta-optimization provides a promising approach to solving non-linear constrained optimization problems, which trains neural networks as optimizers to perform optimization on Riemannian manifolds. However, existing Riemannian meta-optimization methods take up huge memory footprints in large-scale optimization settings, as the learned optimizer can only adapt gradients of a fixed size and thus cannot be shared across different Riemannian parameters. In this paper, we propose an efficient Riemannian meta-optimization method that significantly reduces the memory burden for large-scale optimization via a subspace adaptation scheme. Our method trains neural networks to individually adapt the row and column subspaces of Riemannian gradients, instead of directly adapting the full gradient matrices in existing Riemannian meta-optimization methods. In this case, our learned optimizer can be shared across Riemannian parameters with different sizes. Our method reduces the model memory consumption by six orders of magnitude when optimizing an orthogonal mainstream deep neural network (e.g., ResNet50). Experiments on multiple Riemannian tasks show that our method can not only reduce the memory consumption but also improve the performance of Riemannian meta-optimization.
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
Gao, Zhi, Zhang, Bofei, Li, Pengxiang, Ma, Xiaojian, Yuan, Tao, Fan, Yue, Wu, Yuwei, Jia, Yunde, Zhu, Song-Chun, Li, Qing
Query: I want to buy a PS5 for each child in the photo. Thought: Use the `facedetection` tool to detect Thought: First analyze the image 1 to find the number human faces in the two images. Faces in Image 1: 4 bounding boxes Thought: There are 4 children in total. The price of Price of PS5: $479.99 a PS5 is approximately $500, so the cost is 4* 500. Thought: Using the price of $479.99 for each console. Query: The men in the picture want to buy one NVIDIA GeForce RTX 4070 SUPER each. According to the price in January, how many dollars will they need to spend in total? Observation: This image does not provide any price. On January 8, 2024, Nvidia released the RTX Thought: I cannot obtain useful information. I 4070 SUPER at $599, think the price is about $1800 for three men. Thought: The price is $599. Our agent chooses more precise tools based on the given files and intermediate observations. The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via Trajectory Tuning on VLMs for Tool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: MiniCPM-V-8.5B Integrating external tools to solve diverse multi-modal tasks is a promising research direction towards multi-modal agents (Surís et al., 2023; Gupta & Kembhavi, 2023; Gao et al., 2024; Yuan et al., 2024; Zhong et al., 2023). Existing agents usually use a large language model (LLM) as the controller that generates plans via prompt engineering to call tools, achieving impressive performance in multiple domains, such as image editing (Wu et al., 2023), robotic manipulation (ichter et al., 2023), question answering (Shen et al., 2024), video understanding (Fan et al., 2024), and desktop APPs (Trivedi et al., 2024). Despite their success, prompt engineering faces limited reasoning abilities for tool usage in tackling practical tasks, as shown in Figure 1.
SyreaNet: A Physically Guided Underwater Image Enhancement Framework Integrating Synthetic and Real Images
Wen, Junjie, Cui, Jinqiang, Zhao, Zhenjun, Yan, Ruixin, Gao, Zhi, Dou, Lihua, Chen, Ben M.
Underwater image enhancement (UIE) is vital for high-level vision-related underwater tasks. Although learning-based UIE methods have made remarkable achievements in recent years, it's still challenging for them to consistently deal with various underwater conditions, which could be caused by: 1) the use of the simplified atmospheric image formation model in UIE may result in severe errors; 2) the network trained solely with synthetic images might have difficulty in generalizing well to real underwater images. In this work, we, for the first time, propose a framework \textit{SyreaNet} for UIE that integrates both synthetic and real data under the guidance of the revised underwater image formation model and novel domain adaptation (DA) strategies. First, an underwater image synthesis module based on the revised model is proposed. Then, a physically guided disentangled network is designed to predict the clear images by combining both synthetic and real underwater images. The intra- and inter-domain gaps are abridged by fully exchanging the domain knowledge. Extensive experiments demonstrate the superiority of our framework over other state-of-the-art (SOTA) learning-based UIE methods qualitatively and quantitatively. The code and dataset are publicly available at https://github.com/RockWenJJ/SyreaNet.git.
Meta-causal Learning for Single Domain Generalization
Chen, Jin, Gao, Zhi, Wu, Xinxiao, Luo, Jiebo
Single domain generalization aims to learn a model from a single training domain (source domain) and apply it to multiple unseen test domains (target domains). Existing methods focus on expanding the distribution of the training domain to cover the target domains, but without estimating the domain shift between the source and target domains. In this paper, we propose a new learning paradigm, namely simulate-analyze-reduce, which first simulates the domain shift by building an auxiliary domain as the target domain, then learns to analyze the causes of domain shift, and finally learns to reduce the domain shift for model adaptation. Under this paradigm, we propose a meta-causal learning method to learn meta-knowledge, that is, how to infer the causes of domain shift between the auxiliary and source domains during training. We use the meta-knowledge to analyze the shift between the target and source domains during testing. Specifically, we perform multiple transformations on source data to generate the auxiliary domain, perform counterfactual inference to learn to discover the causal factors of the shift between the auxiliary and source domains, and incorporate the inferred causality into factor-aware domain alignments. Extensive experiments on several benchmarks of image classification show the effectiveness of our method.
TJ-FlyingFish: Design and Implementation of an Aerial-Aquatic Quadrotor with Tiltable Propulsion Units
Liu, Xuchen, Dou, Minghao, Huang, Dongyue, Wang, Biao, Cui, Jinqiang, Ren, Qinyuan, Dou, Lihua, Gao, Zhi, Chen, Jie, Chen, Ben M.
Aerial-aquatic vehicles are capable to move in the two most dominant fluids, making them more promising for a wide range of applications. We propose a prototype with special designs for propulsion and thruster configuration to cope with the vast differences in the fluid properties of water and air. For propulsion, the operating range is switched for the different mediums by the dual-speed propulsion unit, providing sufficient thrust and also ensuring output efficiency. For thruster configuration, thrust vectoring is realized by the rotation of the propulsion unit around the mount arm, thus enhancing the underwater maneuverability. This paper presents a quadrotor prototype of this concept and the design details and realization in practice.
Generating Multivariate Load States Using a Conditional Variational Autoencoder
Wang, Chenguang, Sharifnia, Ensieh, Gao, Zhi, Tindemans, Simon H., Palensky, Peter
For planning of power systems and for the calibration of operational tools, it is essential to analyse system performance in a large range of representative scenarios. When the available historical data is limited, generative models are a promising solution, but modelling high-dimensional dependencies is challenging. In this paper, a multivariate load state generating model on the basis of a conditional variational autoencoder (CVAE) neural network is proposed. Going beyond common CVAE implementations, the model includes stochastic variation of output samples under given latent vectors and co-optimizes the parameters for this output variability. It is shown that this improves statistical properties of the generated data. The quality of generated multivariate loads is evaluated using univariate and multivariate performance metrics. A generation adequacy case study on the European network is used to illustrate model's ability to generate realistic tail distributions. The experiments demonstrate that the proposed generator outperforms other data generating mechanisms.
DynaVIG: Monocular Vision/INS/GNSS Integrated Navigation and Object Tracking for AGV in Dynamic Scenes
Jin, Ronghe, Wang, Yan, Gao, Zhi, Niu, Xiaoji, Hsu, Li-Ta, Liu, Jingnan
Visual-Inertial Odometry (VIO) usually suffers from drifting over long-time runs, the accuracy is easily affected by dynamic objects. We propose DynaVIG, a navigation and object tracking system based on the integration of Monocular Vision, Inertial Navigation System (INS), and Global Navigation Satellite System (GNSS). Our system aims to provide an accurate global estimation of the navigation states and object poses for the automated ground vehicle (AGV) in dynamic scenes. Due to the scale ambiguity of the object, a prior height model is proposed to initialize the object pose, and the scale is continuously estimated with the aid of GNSS and INS. To precisely track the object with complex moving, we establish an accurate dynamics model according to its motion state. Then the multi-sensor observations are optimized in a unified framework. Experiments on the KITTI dataset demonstrate that the multisensor fusion can effectively improve the accuracy of navigation and object tracking, compared to state-of-the-art methods. In addition, the proposed system achieves good estimation of the objects that change speed or direction.
Superevents: Towards Native Semantic Segmentation for Event-based Cameras
Low, Weng Fei, Sonthalia, Ankit, Gao, Zhi, van Schaik, André, Ramesh, Bharath
Most successful computer vision models transform low-level features, such as Gabor filter responses, into richer representations of intermediate or mid-level complexity for downstream visual tasks. These mid-level representations have not been explored for event cameras, although it is especially relevant to the visually sparse and often disjoint spatial information in the event stream. By making use of locally consistent intermediate representations, termed as superevents, numerous visual tasks ranging from semantic segmentation, visual tracking, depth estimation shall benefit. In essence, superevents are perceptually consistent local units that delineate parts of an object in a scene. Inspired by recent deep learning architectures, we present a novel method that employs lifetime augmentation for obtaining an event stream representation that is fed to a fully convolutional network to extract superevents. Our qualitative and quantitative experimental results on several sequences of a benchmark dataset highlights the significant potential for event-based downstream applications.