Goto

Collaborating Authors

 Huang, Wenjun


PacketCLIP: Multi-Modal Embedding of Network Traffic and Language for Cybersecurity Reasoning

arXiv.org Artificial Intelligence

Traffic classification is vital for cybersecurity, yet encrypted traffic poses significant challenges. We present PacketCLIP, a multi-modal framework combining packet data with natural language semantics through contrastive pretraining and hierarchical Graph Neural Network (GNN) reasoning. PacketCLIP integrates semantic reasoning with efficient classification, enabling robust detection of anomalies in encrypted network flows. By aligning textual descriptions with packet behaviors, it offers enhanced interpretability, scalability, and practical applicability across diverse security scenarios. PacketCLIP achieves a 95% mean AUC, outperforms baselines by 11.6%, and reduces model size by 92%, making it ideal for real-time anomaly detection. By bridging advanced machine learning techniques and practical cybersecurity needs, PacketCLIP provides a foundation for scalable, efficient, and interpretable solutions to tackle encrypted traffic classification and network intrusion detection challenges in resource-constrained environments.


Hyperdimensional Intelligent Sensing for Efficient Real-Time Audio Processing on Extreme Edge

arXiv.org Artificial Intelligence

The escalating challenges of managing vast sensor-generated data, particularly in audio applications, necessitate innovative solutions. Current systems face significant computational and storage demands, especially in real-time applications like gunshot detection systems (GSDS), and the proliferation of edge sensors exacerbates these issues. This paper proposes a groundbreaking approach with a near-sensor model tailored for intelligent audio-sensing frameworks. Utilizing a Fast Fourier Transform (FFT) module, convolutional neural network (CNN) layers, and HyperDimensional Computing (HDC), our model excels in low-energy, rapid inference, and online learning. It is highly adaptable for efficient ASIC design implementation, offering superior energy efficiency compared to conventional embedded CPUs or GPUs, and is compatible with the trend of shrinking microphone sensor sizes. Comprehensive evaluations at both software and hardware levels underscore the model's efficacy. Software assessments through detailed ROC curve analysis revealed a delicate balance between energy conservation and quality loss, achieving up to 82.1% energy savings with only 1.39% quality loss. Hardware evaluations highlight the model's commendable energy efficiency when implemented via ASIC design, especially with the Google Edge TPU, showcasing its superiority over prevalent embedded CPUs and GPUs.


Tell Me What to Track: Infusing Robust Language Guidance for Enhanced Referring Multi-Object Tracking

arXiv.org Artificial Intelligence

Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to localize an arbitrary number of targets based on a language expression and continuously track them in a video. This intricate task involves reasoning on multi-modal data and precise target localization with temporal association. However, prior studies overlook the imbalanced data distribution between newborn targets and existing targets due to the nature of the task. In addition, they only indirectly fuse multi-modal features, struggling to deliver clear guidance on newborn target detection. To solve the above issues, we conduct a collaborative matching strategy to alleviate the impact of the imbalance, boosting the ability to detect newborn targets while maintaining tracking performance. In the encoder, we integrate and enhance the cross-modal and multi-scale fusion, overcoming the bottlenecks in previous work, where limited multi-modal information is shared and interacted between feature maps. In the decoder, we also develop a referring-infused adaptation that provides explicit referring guidance through the query tokens. The experiments showcase the superior performance of our model (+3.42%) compared to prior works, demonstrating the effectiveness of our designs.


Multi-cam Multi-map Visual Inertial Localization: System, Validation and Dataset

arXiv.org Artificial Intelligence

Map-based localization is crucial for the autonomous movement of robots as it provides real-time positional feedback. However, existing VINS and SLAM systems cannot be directly integrated into the robot's control loop. Although VINS offers high-frequency position estimates, it suffers from drift in long-term operation. And the drift-free trajectory output by SLAM is post-processed with loop correction, which is non-causal. In practical control, it is impossible to update the current pose with future information. Furthermore, existing SLAM evaluation systems measure accuracy after aligning the entire trajectory, which overlooks the transformation error between the odometry start frame and the ground truth frame. To address these issues, we propose a multi-cam multi-map visual inertial localization system, which provides real-time, causal and drift-free position feedback to the robot control loop. Additionally, we analyze the error composition of map-based localization systems and propose a set of evaluation metric suitable for measuring causal localization performance. To validate our system, we design a multi-camera IMU hardware setup and collect a long-term challenging campus dataset. Experimental results demonstrate the higher real-time localization accuracy of the proposed system. To foster community development, both the system and the dataset have been made open source https://github.com/zoeylove/Multi-cam-Multi-map-VILO/tree/main.


EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment

arXiv.org Artificial Intelligence

Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on multi-modal tasks. In this work, we propose Empowering Multimodal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information. Specifically, we propose a pixel-wise alignment module to autoregressively optimize the learning and processing of spatial image-level features along with textual tokens, enabling structural alignment at the image level. In addition, to prevent the degradation of visual information during the cross-model alignment process, we propose a multi-scale feature fusion (MFF) module to combine multi-scale visual features from intermediate layers, enabling hierarchical alignment at the feature level. Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. Recently there has been a notable increase in the development of domain-general AI agents Kalla et al. (2023); Zhao et al. (2023a) which can simultaneously solve a diverse range of tasks and exhibit superior performance. Among them, multi-modal large language models (MLLMs) Achiam et al. (2023); Team et al. (2023); Liu et al. (2024a) have emerged as a promising direction due to their effectiveness in visual perception and logical reasoning. MLLMs usually consist of an image encoder that converts images to visual tokens and a strong large language model (LLM) backbone to process the visual and textual tokens concurrently. This integration of visual and textual information not only enhances the understanding of visual content but also provides a more comprehensive context for language understanding and generation.


A Plug-in Tiny AI Module for Intelligent and Selective Sensor Data Transmission

arXiv.org Artificial Intelligence

Applications in the Internet of Things (IoT) utilize machine learning to analyze sensor-generated data. However, a major challenge lies in the lack of targeted intelligence in current sensing systems, leading to vast data generation and increased computational and communication costs. To address this challenge, we propose a novel sensing module to equip sensing frameworks with intelligent data transmission capabilities by integrating a highly efficient machine learning model placed near the sensor. This model provides prompt feedback for the sensing system to transmit only valuable data while discarding irrelevant information by regulating the frequency of data transmission. The near-sensor model is quantized and optimized for real-time sensor control. To enhance the framework's performance, the training process is customized and a "lazy" sensor deactivation strategy utilizing temporal information is introduced. The suggested method is orthogonal to other IoT frameworks and can be considered as a plugin for selective data transmission. The framework is implemented, encompassing both software and hardware components. The experiments demonstrate that the framework utilizing the suggested module achieves over 85% system efficiency in terms of energy consumption and storage, with negligible impact on performance. This methodology has the potential to significantly reduce data output from sensors, benefiting a wide range of IoT applications.


HyperSense: Accelerating Hyper-Dimensional Computing for Intelligent Sensor Data Processing

arXiv.org Artificial Intelligence

Introducing HyperSense, our co-designed hardware and software system efficiently controls Analog-to-Digital Converter (ADC) modules' data generation rate based on object presence predictions in sensor data. Addressing challenges posed by escalating sensor quantities and data rates, HyperSense reduces redundant digital data using energy-efficient low-precision ADC, diminishing machine learning system costs. Leveraging neurally-inspired HyperDimensional Computing (HDC), HyperSense analyzes real-time raw low-precision sensor data, offering advantages in handling noise, memory-centricity, and real-time learning. Our proposed HyperSense model combines high-performance software for object detection with real-time hardware prediction, introducing the novel concept of Intelligent Sensor Control. Comprehensive software and hardware evaluations demonstrate our solution's superior performance, evidenced by the highest Area Under the Curve (AUC) and sharpest Receiver Operating Characteristic (ROC) curve among lightweight models. Hardware-wise, our FPGA-based domain-specific accelerator tailored for HyperSense achieves a 5.6x speedup compared to YOLOv4 on NVIDIA Jetson Orin while showing up to 92.1% energy saving compared to the conventional system. These results underscore HyperSense's effectiveness and efficiency, positioning it as a promising solution for intelligent sensing and real-time data processing across diverse applications.


Practical Probabilistic Model-based Deep Reinforcement Learning by Integrating Dropout Uncertainty and Trajectory Sampling

arXiv.org Artificial Intelligence

This paper addresses the prediction stability, prediction accuracy and control capability of the current probabilistic model-based reinforcement learning (MBRL) built on neural networks. A novel approach dropout-based probabilistic ensembles with trajectory sampling (DPETS) is proposed where the system uncertainty is stably predicted by combining the Monte-Carlo dropout and trajectory sampling in one framework. Its loss function is designed to correct the fitting error of neural networks for more accurate prediction of probabilistic models. The state propagation in its policy is extended to filter the aleatoric uncertainty for superior control capability. Evaluated by several Mujoco benchmark control tasks under additional disturbances and one practical robot arm manipulation task, DPETS outperforms related MBRL approaches in both average return and convergence velocity while achieving superior performance than well-known model-free baselines with significant sample efficiency. The open source code of DPETS is available at https://github.com/mrjun123/DPETS.


DAMS-LIO: A Degeneration-Aware and Modular Sensor-Fusion LiDAR-inertial Odometry

arXiv.org Artificial Intelligence

With robots being deployed in increasingly complex environments like underground mines and planetary surfaces, the multi-sensor fusion method has gained more and more attention which is a promising solution to state estimation in the such scene. The fusion scheme is a central component of these methods. In this paper, a light-weight iEKF-based LiDAR-inertial odometry system is presented, which utilizes a degeneration-aware and modular sensor-fusion pipeline that takes both LiDAR points and relative pose from another odometry as the measurement in the update process only when degeneration is detected. Both the Cramer-Rao Lower Bound (CRLB) theory and simulation test are used to demonstrate the higher accuracy of our method compared to methods using a single observation. Furthermore, the proposed system is evaluated in perceptually challenging datasets against various state-of-the-art sensor-fusion methods. The results show that the proposed system achieves real-time and high estimation accuracy performance despite the challenging environment and poor observations.


Video-Based Person Re-Identification via Self Paced Weighting

AAAI Conferences

Person re-identification (re-id) is a fundamental technique to associate various person images, captured by differentsurveillance cameras, to the same person. Compared to the single image based person re-id methods, video-based personre-id has attracted widespread attentions because extra space-time information and more appearance cues that can beused to greatly improve the matching performance. However, most existing video-based person re-id methods equally treatall video frames, ignoring their quality discrepancy caused by object occlusion and motions, which is a common phenomenonin real surveillance scenario. Based on this finding, we propose a novel video-based person re-id method via self paced weighting (SPW). Firstly, we propose a self paced outlier detection method to evaluate the noise degree of video sub sequences. Thereafter, a weighted multi-pair distance metric learning approach is adopted to measure the distance of two person image sequences. Experimental results on two public datasets demonstrate the superiority of the proposed method over current state-of-the-art work.