Plotting

 Yu, Wei


Learning Beamforming Codebooks for Active Sensing with Reconfigurable Intelligent Surface

arXiv.org Artificial Intelligence

--This paper explores the design of beamforming codebooks for the base station (BS) and for the reconfigurable intelligent surfaces (RISs) in an active sensing scheme for uplink localization, in which the mobile user transmits a sequence of pilots to the BS through reflection at the RISs, and the BS and the RISs are adaptively configured by carefully choosing BS beamforming codeword and RIS codewords from their respective codebooks in a sequential manner to progressively focus onto the user . Most existing codebook designs for RIS are not tailored for active sensing, by which we mean the choice of the next codeword should depend on the measurements made so far, and the sequence of codewords should dynamically focus reflection toward the user . Moreover, most existing codeword selection methods rely on exhaustive search in beam training to identify the codeword with the highest signal-to-noise ratio (SNR), thus incurring substantial pilot overhead as the size of the codebook scales. This paper proposes a learning-based approach for codebook construction and for codeword selection for active sensing. The proposed learning approach aims to locate a target in the service area by recursively selecting a sequence of BS beamforming codewords and RIS codewords from the respective codebooks as more measurements become available without exhaustive beam training. The codebook design and the codeword selection fuse key ideas from the vector quantized variational autoencoder (VQ-V AE) and the long short-term memory (LSTM) network to learn respectively the discrete function space of the codebook and the temporal dependencies between measurements. The device is typically placed in the reflecting path between the transceivers, with its configuration wirelessly controlled by the transceivers via a control link. Manuscript submitted to IEEE Transactions on Wireless Communications on September 6, 2024, revised on January 12, 2025, accepted on March 5, 2025. Wei Y u is with The Edward S. Rogers Sr. This work is supported by the Natural Sciences and Engineering Research Council of Canada via the Canada Research Chairs program. The materials in this paper have been accepted in part at the IEEE Workshop on Signal Processing Advances in Wireless Communications (SP A WC), Lucca, Italy, September 2024 [1]. Codebook-based limited control link rate protocol can substantially reduce the control overhead [7], [8]. With the RIS codebook stored at the controller and at the RIS, the controller only needs to send the codeword index in order to configure the RIS.


On Self-Adaptive Perception Loss Function for Sequential Lossy Compression

arXiv.org Artificial Intelligence

We consider causal, low-latency, sequential lossy compression, with mean squared-error (MSE) as the distortion loss, and a perception loss function (PLF) to enhance the realism of reconstructions. As the main contribution, we propose and analyze a new PLF that considers the joint distribution between the current source frame and the previous reconstructions. We establish the theoretical rate-distortion-perception function for first-order Markov sources and analyze the Gaussian model in detail. From a qualitative perspective, the proposed metric can simultaneously avoid the error-permanence phenomenon and also better exploit the temporal correlation between high-quality reconstructions. The proposed metric is referred to as self-adaptive perception loss function (PLF-SA), as its behavior adapts to the quality of reconstructed frames. We provide a detailed comparison of the proposed perception loss function with previous approaches through both information theoretic analysis as well as experiments involving moving MNIST and UVG datasets.


What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph

arXiv.org Artificial Intelligence

Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed G-Prune.In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background.To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of benchmarks.The experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks. For instance, G-Prune can reduce 63.57\% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA with only 0.95\% and 2.34\% accuracy drops, respectively.


Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

arXiv.org Artificial Intelligence

Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model performance through adaptive adjustment of sample learning intensity. Extensive experimental results on benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly regarding accuracy, generalization, and efficiency.


Transfer Learning with Reconstruction Loss

arXiv.org Machine Learning

In most applications of utilizing neural networks for mathematical optimization, a dedicated model is trained for each specific optimization objective. However, in many scenarios, several distinct yet correlated objectives or tasks often need to be optimized on the same set of problem inputs. Instead of independently training a different neural network for each problem separately, it would be more efficient to exploit the correlations between these objectives and to train multiple neural network models with shared model parameters and feature representations. To achieve this, this paper first establishes the concept of common information: the shared knowledge required for solving the correlated tasks, then proposes a novel approach for model training by adding into the model an additional reconstruction stage associated with a new reconstruction loss. This loss is for reconstructing the common information starting from a selected hidden layer in the model. The proposed approach encourages the learned features to be general and transferable, and therefore can be readily used for efficient transfer learning. For numerical simulations, three applications are studied: transfer learning on classifying MNIST handwritten digits, the device-to-device wireless network power allocation, and the multiple-input-single-output network downlink beamforming and localization. Simulation results suggest that the proposed approach is highly efficient in data and model complexity, is resilient to over-fitting, and has competitive performances.


Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning

arXiv.org Artificial Intelligence

The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning.


Theoretical Modeling and Bio-inspired Trajectory Optimization of A Multiple-locomotion Origami Robot

arXiv.org Artificial Intelligence

Recent research on mobile robots has focused on increasing their adaptability to unpredictable and unstructured environments using soft materials and structures. However, the determination of key design parameters and control over these compliant robots are predominantly iterated through experiments, lacking a solid theoretical foundation. To improve their efficiency, this paper aims to provide mathematics modeling over two locomotion, crawling and swimming. Specifically, a dynamic model is first devised to reveal the influence of the contact surfaces' frictional coefficients on displacements in different motion phases. Besides, a swimming kinematics model is provided using coordinate transformation, based on which, we further develop an algorithm that systematically plans human-like swimming gaits, with maximum thrust obtained. The proposed algorithm is highly generalizable and has the potential to be applied in other soft robots with multiple joints. Simulation experiments have been conducted to illustrate the effectiveness of the proposed modeling.


Rate-Distortion-Perception Tradeoff Based on the Conditional-Distribution Perception Measure

arXiv.org Artificial Intelligence

We study the rate-distortion-perception (RDP) tradeoff for a memoryless source model in the asymptotic limit of large block-lengths. Our perception measure is based on a divergence between the distributions of the source and reconstruction sequences conditioned on the encoder output, which was first proposed in [1], [2]. We consider the case when there is no shared randomness between the encoder and the decoder. For the case of discrete memoryless sources we derive a single-letter characterization of the RDP function, thus settling a problem that remains open for the marginal metric introduced in Blau and Michaeli [3] (with no shared randomness). Our achievability scheme is based on lossy source coding with a posterior reference map proposed in [4]. For the case of continuous valued sources under squared error distortion measure and squared quadratic Wasserstein perception measure we also derive a single-letter characterization and show that a noise-adding mechanism at the decoder suffices to achieve the optimal representation. For the case of zero perception loss, we show that our characterization interestingly coincides with the results for the marginal metric derived in [5], [6] and again demonstrate that zero perception loss can be achieved with a $3$-dB penalty in the minimum distortion. Finally we specialize our results to the case of Gaussian sources. We derive the RDP function for vector Gaussian sources and propose a waterfilling type solution. We also partially characterize the RDP function for a mixture of vector Gaussians.


Localization with Reconfigurable Intelligent Surface: An Active Sensing Approach

arXiv.org Artificial Intelligence

This paper addresses an uplink localization problem in which a base station (BS) aims to locate a remote user with the help of reconfigurable intelligent surfaces (RISs). We propose a strategy in which the user transmits pilots sequentially and the BS adaptively adjusts the sensing vectors, including the BS beamforming vector and multiple RIS reflection coefficients based on the observations already made, to eventually produce an estimated user position. This is a challenging active sensing problem for which finding an optimal solution involves searching through a complicated functional space whose dimension increases with the number of measurements. We show that the long short-term memory (LSTM) network can be used to exploit the latent temporal correlation between measurements to automatically construct scalable state vectors. Subsequently, the state vector is mapped to the sensing vectors for the next time frame via a deep neural network (DNN). A final DNN is used to map the state vector to the estimated user position. Numerical result illustrates the advantage of the active sensing design as compared to non-active sensing methods. The proposed solution produces interpretable results and is generalizable in the number of sensing stages. Remarkably, we show that a network with one BS and multiple RISs can outperform a comparable setting with multiple BSs.


Levenshtein Distance Embedding with Poisson Regression for DNA Storage

arXiv.org Artificial Intelligence

Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.