Plotting

 Lin, Hongbin


PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models

arXiv.org Artificial Intelligence

3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmented-Engine), a new framework for generating instruction point-language datasets enriched with 3D spatial semantics. We observe that existing 3D MLLMs offer a comprehensive understanding of point clouds for annotation, while 2D MLLMs excel at cross-validation by providing complementary information. By integrating holistic 2D and 3D insights from off-the-shelf MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation. We select PointLLM as the baseline and adopt this co-evolution training framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA. Additionally, we identify limitations in previous 3D benchmarks, which often feature coarse language captions and insufficient category diversity, resulting in inaccurate evaluations. To address this gap, we further introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels. Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification on our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and 63.75% (+16.25%), respectively. We will release the code, datasets, and benchmark.


Towards Multi-dimensional Explanation Alignment for Medical Classification

arXiv.org Artificial Intelligence

The lack of interpretability in the field of medical image analysis has significant ethical and legal implications. Existing interpretable methods in this domain encounter several challenges, including dependency on specific models, difficulties in understanding and visualization, as well as issues related to efficiency. To address these limitations, we propose a novel framework called Med-MICN (Medical Multi-dimensional Interpretable Concept Network). Med-MICN provides interpretability alignment for various angles, including neural symbolic reasoning, concept semantics, and saliency maps, which are superior to current interpretable methods. Its advantages include high prediction accuracy, interpretability across multiple dimensions, and automation through an end-to-end concept labeling process that reduces the need for extensive human training effort when working with new datasets. To demonstrate the effectiveness and interpretability of Med-MICN, we apply it to four benchmark datasets and compare it with baselines. The results clearly demonstrate the superior performance and interpretability of our Med-MICN.


End-to-End Learning of Deep Visuomotor Policy for Needle Picking

arXiv.org Artificial Intelligence

Needle picking is a challenging manipulation task in robot-assisted surgery due to the characteristics of small slender shapes of needles, needles' variations in shapes and sizes, and demands for millimeter-level control. Prior works, heavily relying on the prior of needles (e.g., geometric models), are hard to scale to unseen needles' variations. In this paper, we present the first end-to-end learning method to train deep visuomotor policy for needle picking. Concretely, we propose DreamerfD to maximally leverage demonstrations to improve the learning efficiency of a state-of-the-art model-based reinforcement learning method, DreamerV2; Since Variational Auto-Encoder (VAE) in DreamerV2 is difficult to scale to high-resolution images, we propose Dynamic Spotlight Adaptation to represent control-related visual signals in a low-resolution image space; Virtual Clutch is also proposed to reduce performance degradation due to significant error between prior and posterior encoded states at the beginning of a rollout. We conducted extensive experiments in simulation to evaluate the performance, robustness, in-domain variation adaptation, and effectiveness of individual components of our method. Our method, trained by 8k demonstration timesteps and 140k online policy timesteps, can achieve a remarkable success rate of 80%. Furthermore, our method effectively demonstrated its superiority in generalization to unseen in-domain variations including needle variations and image disturbance, highlighting its robustness and versatility. Codes and videos are available at https://sites.google.com/view/DreamerfD.


SSIM-Variation-Based Complexity Optimization for Versatile Video Coding

arXiv.org Artificial Intelligence

To date, Versatile Video Coding (VVC) has a more magnificent overall performance than High Efficiency Video Coding (HEVC). The Quadtree with Nested Multi-Type Tree (QTMT) coding block structure can substantially enhance video coding quality in VVC. However, the coding gain also leads to a greater coding complexity. Therefore, this letter proposes a Fast Decision Scheme Based on Structural Similarity Index Metric Variation (FDS-SSIMV) to solve this problem. Firstly, the Structural Similarity Index Metric Variation (SSIMV) characteristic among the sub coding units of the spit mode is illustrated. Next, to evaluate the SSIMV value, SSIMV measure strategies are designed for different split modes in this letter. Then, the desired split modes are selected by the SSIMV values. Experimental results show that the proposed method achieves 64.74\% average encoding Time Saving (TS) with a 2.79\% Bj$\varnothing$ntegaard Delta Bit Rate (BDBR), outperforming the benchmarks.


Fast, Robust, and Versatile Event Detection through HMM Belief State Gradient Measures

arXiv.org Artificial Intelligence

Event detection is a critical feature in data-driven systems as it assists with the identification of nominal and anomalous behavior. Event detection is increasingly relevant in robotics as robots operate with greater autonomy in increasingly unstructured environments. In this work, we present an accurate, robust, fast, and versatile measure for skill and anomaly identification. A theoretical proof establishes the link between the derivative of the log-likelihood of the HMM filtered belief state and the latest emission probabilities. The key insight is the inverse relationship in which gradient analysis is used for skill and anomaly identification. Our measure showed better performance across all metrics than related state-of-the art works. The result is broadly applicable to domains that use HMMs for event detection.