semantic guidance
SEGA: Instructing Text-to-Image Models using Semantic Guidance
Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions.
COVLM-RL: Critical Object-Oriented Reasoning for Autonomous Driving Using VLM-Guided Reinforcement Learning
Li, Lin, Cai, Yuxin, Fang, Jianwu, Xue, Jianru, Lv, Chen
End-to-end autonomous driving frameworks face persistent challenges in generalization, training efficiency, and interpretability. While recent methods leverage Vision-Language Models (VLMs) through supervised learning on large-scale datasets to improve reasoning, they often lack robustness in novel scenarios. Conversely, reinforcement learning (RL)-based approaches enhance adaptability but remain data-inefficient and lack transparent decision-making. % contribution To address these limitations, we propose COVLM-RL, a novel end-to-end driving framework that integrates Critical Object-oriented (CO) reasoning with VLM-guided RL. Specifically, we design a Chain-of-Thought (CoT) prompting strategy that enables the VLM to reason over critical traffic elements and generate high-level semantic decisions, effectively transforming multi-view visual inputs into structured semantic decision priors. These priors reduce the input dimensionality and inject task-relevant knowledge into the RL loop, accelerating training and improving policy interpretability. However, bridging high-level semantic guidance with continuous low-level control remains non-trivial. To this end, we introduce a consistency loss that encourages alignment between the VLM's semantic plans and the RL agent's control outputs, enhancing interpretability and training stability. Experiments conducted in the CARLA simulator demonstrate that COVLM-RL significantly improves the success rate by 30\% in trained driving environments and by 50\% in previously unseen environments, highlighting its strong generalization capability.
- Asia > Singapore (0.05)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Transportation > Ground > Road (1.00)
- Information Technology > Robotics & Automation (0.63)
SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion
Zhang, Xiaoyang, Li, jinjiang, Fan, Guodong, Ju, Yakun, Fan, Linwei, Liu, Jun, Kot, Alex C.
Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model's coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.
- Europe > United Kingdom > England > Leicestershire > Leicester (0.04)
- Asia > China > Shandong Province > Yantai (0.04)
- Asia > China > Fujian Province > Fuzhou (0.04)
- (2 more...)
GUIDES: Guidance Using Instructor-Distilled Embeddings for Pre-trained Robot Policy Enhancement
Gao, Minquan, Li, Xinyi, Yan, Qing, Sun, Xiaojian, Zhang, Xiaopan, Huang, Chien-Ming, Li, Jiachen
Pre-trained robot policies serve as the foundation of many validated robotic systems, which encapsulate extensive embodied knowledge. However, they often lack the semantic awareness characteristic of foundation models, and replacing them entirely is impractical in many situations due to high costs and the loss of accumulated knowledge. To address this gap, we introduce GUIDES, a lightweight framework that augments pre-trained policies with semantic guidance from foundation models without requiring architectural redesign. GUIDES employs a fine-tuned vision-language model (Instructor) to generate contextual instructions, which are encoded by an auxiliary module into guidance embeddings. These embeddings are injected into the policy's latent space, allowing the legacy model to adapt to this new semantic input through brief, targeted fine-tuning. For inference-time robustness, a large language model-based Reflector monitors the Instructor's confidence and, when confidence is low, initiates a reasoning loop that analyzes execution history, retrieves relevant examples, and augments the VLM's context to refine subsequent actions. Extensive validation in the RoboCasa simulation environment across diverse policy architectures shows consistent and substantial improvements in task success rates. Real-world deployment on a UR5 robot further demonstrates that GUIDES enhances motion precision for critical sub-tasks such as grasping. Overall, GUIDES offers a practical and resource-efficient pathway to upgrade, rather than replace, validated robot policies.
- North America > United States > California > Riverside County > Riverside (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Direct Semantic Communication Between Large Language Models via Vector Translation
Yang, Fu-Chun, Eshraghian, Jason
When two Large Language Models (LLMs) debate an answer, critique each other's chain of thought, or sequentially refine a shared draft of text, they speak through plain tokens. Every round forces each model to flatten rich geometry into text, operate on that, then rebuild meaning. Ultimately, computational resources are wasted, and limited information bandwidth can erase nuance. Specialised LLMs thus operate in isolation, communication only through text interfaces that constrain information transfer and add overhead. Encoding semantics into tokens and re-decoding them discards much of the latent structure that models use internally, blurring complex relationships in the process. Yet each LLM carries a distinct internal representation space shaped by architecture, training objective, and data. Those spaces differ enough that raw vectors are not interchangeable, prompting the question: Can semantic information encoded in one model's vector space be translated so another model can use them directly? We demonstrate this is possible by learning bidirectional vector translations that create a latent bridge between models. Injecting these translated vectors directly into a target model's pipeline lets the pair share meaning without serialising to tokens, enabling chains, ensembles, and parallel collaborations to run at latent speed, and bypass text-based limitations.
- Research Report > New Finding (0.69)
- Research Report > Experimental Study (0.69)
GenDexHand: Generative Simulation for Dexterous Hands
Chen, Feng, Xu, Zhuxiu, Chu, Tianzhe, Zhou, Xunzhe, Sun, Li, Wu, Zewen, Gao, Shenghua, Li, Zhongyu, Yang, Yanchao, Ma, Yi
Data scarcity remains a fundamental bottleneck for embodied intelligence. Existing approaches use large language models (LLMs) to automate gripper-based simulation generation, but they transfer poorly to dexterous manipulation, which demands more specialized environment design. Meanwhile, dexterous manipulation tasks are inherently more difficult due to their higher degrees of freedom. Massively generating feasible and trainable dexterous hand tasks remains an open challenge. To this end, we present GenDexHand, a generative simulation pipeline that autonomously produces diverse robotic tasks and environments for dexterous manipulation. GenDexHand introduces a closed-loop refinement process that adjusts object placements and scales based on vision-language model (VLM) feedback, substantially improving the average quality of generated environments. Each task is further decomposed into sub-tasks to enable sequential reinforcement learning, reducing training time and increasing success rates. Our work provides a viable path toward scalable training of diverse dexterous hand behaviors in embodied intelligence by offering a simulation-based solution to synthetic data generation. Our website: https://winniechen2002.github.io/GenDexHand/.
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
- Media (0.48)
- Information Technology (0.46)
TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting
Guo, Zhongbin, Wang, Yuhao, Jian, Ping, Li, Chengzhi, Chen, Xinyue, Yang, Zhen, E, Ertai
Temporal Change Description (TCD) and Future Satellite Image Forecasting (FSIF) are critical, yet historically disjointed tasks in Satellite Image Time Series (SITS) analysis. Both are fundamentally limited by the common challenge of modeling long-range temporal dynamics. To explore how to improve the performance of methods on both tasks simultaneously by enhancing long-range temporal understanding capabilities, we introduce TAMMs, the first unified framework designed to jointly perform TCD and FSIF within a single MLLM-diffusion architecture. TAMMs introduces two key innovations: Temporal Adaptation Modules (TAM) enhance frozen MLLM's ability to comprehend long-range dynamics, and Semantic-Fused Control Injection (SFCI) mechanism translates this change understanding into fine-grained generative control. This synergistic design makes the understanding from the TCD task to directly inform and improve the consistency of the FSIF task. Extensive experiments demonstrate TAMMs significantly outperforms state-of-the-art specialist baselines on both tasks.
- North America > United States (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Energy (0.48)
- Transportation (0.46)
- Asia > China (0.14)
- North America > United States > California (0.14)
SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient, Parallel-Text Adapters
The Segment Anything Model (SAM) has demonstrated impressive generalization in prompt-based segmentation. Yet, the potential of semantic text prompts remains underexplored compared to traditional spatial prompts like points and boxes. This paper introduces SAM-PTx, a parameter-efficient approach for adapting SAM using frozen CLIP-derived text embeddings as class-level semantic guidance. Specifically, we propose a lightweight adapter design called Parallel-Text that injects text embeddings into SAM's image encoder, enabling semantics-guided segmentation while keeping most of the original architecture frozen. Our adapter modifies only the MLP-parallel branch of each transformer block, preserving the attention pathway for spatial reasoning. Through supervised experiments and ablations on the COD10K dataset as well as low-data subsets of COCO and ADE20K, we show that incorporating fixed text embeddings as input improves segmentation performance over purely spatial prompt baselines. To our knowledge, this is the first work to use text prompts for segmentation on the COD10K dataset. These results suggest that integrating semantic conditioning into SAM's architecture offers a practical and scalable path for efficient adaptation with minimal computational complexity.