Goto

Collaborating Authors

 vlp


G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training

Neural Information Processing Systems

Medical imaging tasks require an understanding of subtle and localized visual features due to the inherently detailed and area-specific nature of pathological patterns, which are crucial for clinical diagnosis. Although recent advances in medical vision-language pre-training (VLP) enable models to learn clinically relevant visual features by leveraging both medical images and their associated radiology reports, current medical VLP methods primarily focus on aligning images with entire reports. This focus hinders the learning of dense (pixel-level) visual features and is suboptimal for dense prediction tasks (e.g., medical image segmentation).To address this challenge, we propose a novel medical VLP framework, named Global to Dense level representation learning (G2D)


Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training

Neural Information Processing Systems

Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective.


Synthesizing Visual Concepts as Vision-Language Programs

arXiv.org Artificial Intelligence

Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing interpretable logical rules, though they exploit rigid, domain-specific perception modules. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with systematic reasoning of program synthesis. Rather than embedding reasoning inside the VLM, VLP leverages the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, particularly on tasks requiring complex logical reasoning.


Performance Evaluation of an Integrated System for Visible Light Communication and Positioning Using an Event Camera

arXiv.org Artificial Intelligence

Event cameras, featuring high temporal resolution and high dynamic range, offer visual sensing capabilities comparable to conventional image sensors while capturing fast-moving objects and handling scenes with extreme lighting contrasts such as tunnel exits. Leveraging these properties, this study proposes a novel self-localization system that integrates visible light communication (VLC) and visible light positioning (VLP) within a single event camera. The system enables a vehicle to estimate its position even in GPS-denied environments, such as tunnels, by using VLC to obtain coordinate information from LED transmitters and VLP to estimate the distance to each transmitter. Multiple LEDs are installed on the transmitter side, each assigned a unique pilot sequence based on Walsh-Hadamard codes. The event camera identifies individual LEDs within its field of view by correlating the received signal with these codes, allowing clear separation and recognition of each light source. This mechanism enables simultaneous high-capacity MISO (multi-input single-output) communication through VLC and precise distance estimation via phase-only correlation (POC) between multiple LED pairs. To the best of our knowledge, this is the first vehicle-mounted system to achieve simultaneous VLC and VLP functionalities using a single event camera. Field experiments were conducted by mounting the system on a vehicle traveling at 30 km/h (8.3 m/s). The results demonstrated robust real-world performance, with a root mean square error (RMSE) of distance estimation within 0.75 m for ranges up to 100 m and a bit error rate (BER) below 0.01 across the same range.


The Security Threat of Compressed Projectors in Large Vision-Language Models

arXiv.org Artificial Intelligence

The choice of a suitable visual language projector (VLP) is critical to the successful training of large visual language models (LVLMs). Mainstream VLPs can be broadly categorized into compressed and uncompressed projectors, and each offers distinct advantages in performance and computational efficiency. However, their security implications have not been thoroughly examined. Our comprehensive evaluation reveals significant differences in their security profiles: compressed projectors exhibit substantial vulnerabilities, allowing adversaries to successfully compromise LVLMs even with minimal knowledge of structure information. In stark contrast, uncompressed projectors demonstrate robust security properties and do not introduce additional vulnerabilities. These findings provide critical guidance for researchers in selecting optimal VLPs that enhance the security and reliability of visual language models. The code is available at https://github.com/btzyd/TCP.


G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training

Neural Information Processing Systems

Medical imaging tasks require an understanding of subtle and localized visual features due to the inherently detailed and area-specific nature of pathological patterns, which are crucial for clinical diagnosis. Although recent advances in medical vision-language pre-training (VLP) enable models to learn clinically relevant visual features by leveraging both medical images and their associated radiology reports, current medical VLP methods primarily focus on aligning images with entire reports. This focus hinders the learning of dense (pixel-level) visual features and is suboptimal for dense prediction tasks (e.g., medical image segmentation).To address this challenge, we propose a novel medical VLP framework, named Global to Dense level representation learning (G2D), which aims to learn global and dense visual features simultaneously using only image-text pairs without extra annotations. In particular, G2D designs a Pseudo Segmentation (PS) task, which enables the model to learn dense visual features during VLP. Notably, generating PS masks can be performed on the fly during VLP, which does not incur extra trainable parameters.


VLP: Vision-Language Preference Learning for Embodied Manipulation

arXiv.org Artificial Intelligence

Reward engineering is one of the key challenges in Reinforcement Learning (RL). Preference-based RL effectively addresses this issue by learning from human feedback. However, it is both time-consuming and expensive to collect human preference labels. In this paper, we propose a novel \textbf{V}ision-\textbf{L}anguage \textbf{P}reference learning framework, named \textbf{VLP}, which learns a vision-language preference model to provide preference feedback for embodied manipulation tasks. To achieve this, we define three types of language-conditioned preferences and construct a vision-language preference dataset, which contains versatile implicit preference orders without human annotations. The preference model learns to extract language-related features, and then serves as a preference annotator in various downstream tasks. The policy can be learned according to the annotated preferences via reward learning or direct policy optimization. Extensive empirical results on simulated embodied manipulation tasks demonstrate that our method provides accurate preferences and generalizes to unseen tasks and unseen language instructions, outperforming the baselines by a large margin.


Stochastic Learning of Non-Conjugate Variational Posterior for Image Classification

arXiv.org Machine Learning

Large scale Bayesian nonparametrics (BNP) learner such as stochastic variational inference (SVI) can handle datasets with large class number and large training size at fractional cost. Like its predecessor, SVI rely on the assumption of conjugate variational posterior to approximate the true posterior. A more challenging problem is to consider large scale learning on non-conjugate posterior. Recent works in this direction are mostly associated with using Monte Carlo methods for approximating the learner. However, these works are usually demonstrated on non-BNP related task and less complex models such as logistic regression, due to higher computational complexity. In order to overcome the issue faced by SVI, we develop a novel approach based on the recently proposed variational maximization-maximization (VMM) learner to allow large scale learning on non-conjugate posterior. Unlike SVI, our VMM learner does not require closed-form expression for the variational posterior expectatations. Our only requirement is that the variational posterior is differentiable. In order to ensure convergence in stochastic settings, SVI rely on decaying step-sizes to slow its learning. Inspired by SVI and Adam, we propose the novel use of decaying step-sizes on both gradient and ascent direction in our VMM to significantly improve its learning. We show that our proposed methods is compatible with ResNet features when applied to large class number datasets such as MIT67 and SUN397. Finally, we compare our proposed learner with several recent works such as deep clustering algorithms and showed we were able to produce on par or outperform the state-of-the-art methods in terms of clustering measures.


MG-3D: Multi-Grained Knowledge-Enhanced 3D Medical Vision-Language Pre-training

arXiv.org Artificial Intelligence

3D medical image analysis is pivotal in numerous clinical applications. However, the scarcity of labeled data and limited generalization capabilities hinder the advancement of AI-empowered models. Radiology reports are easily accessible and can serve as weakly-supervised signals. However, large-scale vision-language pre-training (VLP) remains underexplored in 3D medical image analysis. Specifically, the insufficient investigation into multi-grained radiology semantics and their correlations across patients leads to underutilization of large-scale volume-report data. Considering intra-patient cross-modal semantic consistency and inter-patient semantic correlations, we propose a multi-task VLP method, MG-3D, pre-trained on large-scale data (47.1K), addressing the challenges by the following two aspects: 1) Establishing the correspondence between volume semantics and multi-grained medical knowledge of each patient with cross-modal global alignment and complementary modality-guided local reconstruction, ensuring intra-patient features of different modalities cohesively represent the same semantic content; 2) Correlating inter-patient visual semantics based on fine-grained report correlations across patients, and keeping sensitivity to global individual differences via contrastive learning, enhancing the discriminative feature representation. Furthermore, we delve into the scaling law to explore potential performance improvements. Comprehensive evaluations across nine uni- and cross-modal clinical tasks are carried out to assess model efficacy. Extensive experiments on both internal and external datasets demonstrate the superior transferability, scalability, and generalization of MG-3D, showcasing its potential in advancing feature representation for 3D medical image analysis. Code will be available: https://github.com/Xuefeng-Ni/MG-3D.


Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training

Neural Information Processing Systems

Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective.