Goto

Collaborating Authors

 semantically


Sharing Key Semantics in Transformer Makes Efficient Image Restoration

Neural Information Processing Systems

Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the emergence of Vision Transformers (ViTs) has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objects or regions. This inclusivity introduces computational inefficiencies, particularly noticeable with high input resolution, as it requires processing irrelevant information, thereby impeding efficiency. Additionally, for IR, it is commonly noted that small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process, as they contribute essential contextual cues crucial for accurate reconstruction. To address these challenges, we propose boosting IR's performance by sharing the key semantics via Transformer for IR (i.e., SemanIR) in this paper.



SharingKeySemanticsinTransformerMakes EfficientImageRestoration

Neural Information Processing Systems

Image restoration (IR) stands as a fundamental task within low-level computer vision, aiming to enhance the quality of images affected by numerous factors, including noise, blur,lowresolution, compression artifacts, mosaic patterns, adverse weather conditions, and other forms of distortion. This capability holds broad utility across various domains, facilitating information recovery in medical imaging, surveillance, and satellite imagery. Furthermore, it bolsters downstream vision tasks like object detection, recognition, and tracking [74, 60].


Bridge the Points: Graph-based Few-shot Segment Anything Semantically

Neural Information Processing Systems

The recent advancements in large-scale pre-training techniques have significantly enhanced the capabilities of vision foundation models, notably the Segment Anything Model (SAM), which can generate precise masks based on point and box prompts. Recent studies extend SAM to Few-shot Semantic Segmentation (FSS), focusing on prompt generation for SAM-based automatic semantic segmentation. However, these methods struggle with selecting suitable prompts, require specific hyperparameter settings for different scenarios, and experience prolonged one-shot inference times due to the overuse of SAM, resulting in low efficiency and limited automation ability. To address these issues, we propose a simple yet effective approach based on graph analysis. In particular, a Positive-Negative Alignment module dynamically selects the point prompts for generating masks, especially uncovering the potential of the background context as the negative reference.


What You See is What You Read? Improving Text-Image Alignment Evaluation

Neural Information Processing Systems

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.


Annotation-Free Class-Incremental Learning

arXiv.org Artificial Intelligence

Despite significant progress in continual learning ranging from architectural novelty to clever strategies for mitigating catastrophic forgetting; most existing methods rest on a strong but unrealistic assumption: "the availability of labeled data throughout the learning process". In real-world scenarios, however, data often arrives sequentially and without annotations, rendering conventional approaches impractical. In this work, we revisit the fundamental assumptions of continual learning and ask: Can current systems adapt when labels are absent and tasks emerge incrementally over time? T o this end, we introduce Annotation-Free Class-Incremental Learning (AF-CIL), a more realistic and challenging paradigm where unlabeled data arrives continuously, and the learner must incrementally acquire new classes without any supervision. T o enable effective learning under AF-CIL, we propose CrossW orld-CL, a Cross Domain W orld Guided Continual Learning framework that incorporates external world knowledge as a stable auxiliary source. The method retrieves semantically related ImageNet classes for each downstream category, maps downstream and ImageNet features through a cross-domain alignment strategy and finally introduce a novel replay strategy. This design lets the model uncover semantic structure without annotations while keeping earlier knowledge intact.


SYNAPSE: Synergizing an Adapter and Finetuning for High-Fidelity EEG Synthesis from a CLIP-Aligned Encoder

arXiv.org Artificial Intelligence

Recent progress in diffusion-based generative models has enabled high-quality image synthesis conditioned on diverse modalities. Extending such models to brain signals could deepen our understanding of human perception and mental representations. However,electroencephalography (EEG) presents major challenges for image generation due to high noise, low spatial resolution, and strong inter-subject variability. Existing approaches,such as DreamDiffusion, BrainVis, and GWIT, primarily adapt EEG features to pre-trained Stable Diffusion models using complex alignment or classification pipelines, often resulting in large parameter counts and limited interpretability. We introduce SYNAPSE, a two-stage framework that bridges EEG signal representation learning and high-fidelity image synthesis. In Stage1, a CLIP-aligned EEG autoencoder learns a semantically structured latent representation by combining signal reconstruction and cross-modal alignment objectives. In Stage2, the pretrained encoder is frozen and integrated with a lightweight adaptation of Stable Diffusion, enabling efficient conditioning on EEG features with minimal trainable parameters. Our method achieves a semantically coherent latent space and state-of-the-art perceptual fidelity on the CVPR40 dataset, outperforming prior EEG-to-image models in both reconstruction efficiency and image quality. Quantitative and qualitative analyses demonstrate that SYNAPSE generalizes effectively across subjects, preserving visual semantics even when class-level agreement is reduced. These results suggest that reconstructing what the brain perceives, rather than what it classifies, is key to faithful EEG-based image generation.


Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding

arXiv.org Artificial Intelligence

Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.


Modeling Turn-Taking with Semantically Informed Gestures

arXiv.org Artificial Intelligence

In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.


LLM Fingerprinting via Semantically Conditioned Watermarks

arXiv.org Artificial Intelligence

Most LLM fingerprinting methods teach the model to respond to a few fixed queries with predefined atypical responses (keys). This memorization often does not survive common deployment steps such as finetuning or quantization, and such keys can be easily detected and filtered from LLM responses, ultimately breaking the fingerprint. To overcome these limitations we introduce LLM fingerprinting via semantically conditioned watermarks, replacing fixed query sets with a broad semantic domain, and replacing brittle atypical keys with a statistical watermarking signal diffused throughout each response. After teaching the model to watermark its responses only to prompts from a predetermined domain e.g., French language, the model owner can use queries from that domain to reliably detect the fingerprint and verify ownership. As we confirm in our thorough experimental evaluation, our fingerprint is both stealthy and robust to all common deployment scenarios.