fang
Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation
Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented Navigation (HSAN)} framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations--from objects to regions to zones--enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich's duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.
Hippocampal-like Sequential Editing for Continual Knowledge Updates in Large Language Models
Large language models (LLMs) are now pivotal in real-world applications. Model editing has emerged as a promising paradigm for efficiently modifying LLMs without full retraining. However, current editing approaches face significant limitations due to parameter drift, which stems from inconsistencies between newly edited knowledge and the model's existing knowledge. In sequential editing scenarios, cumulative drifts progressively lead to model collapse characterized by general capability degradation and balance between acquiring new knowledge and catastrophic forgetting of existing knowledge. Drawing inspiration from the hippocampal trisynaptic circuit for continual memorizing and forgetting, we propose a Hippocampal-like Sequential Editing (HSE) framework that designs the unlearning of obsolete knowledge, domain-specific knowledge update separation and replay for edited knowledge. Specifically, the HSE framework designs three core mechanisms: (1) Machine unlearning selectively erases outdated knowledge to facilitate integration of new information, (2) Fisher Information Matrix-guided parameter updates prevents cross-domain knowledge interference, and (3) Parameter replay consolidates long-term editing memory through lightweight and global replay of editing data in a parametric form. Theoretical analysis demonstrates that HSE achieves smaller generalization error bounds, more stable convergence and higher computational efficiency.
How Chinese short dramas became AI content machines
The viral short dramas are increasingly being created entirely with AI, with hundreds of new shows spun up each day. In a dimly lit bedroom, a frightened young woman is thrown onto a bed by a tall, muscular man. He grabs her hand, and flame-like vines crawl across her body, fusing with her flesh. A dragon-shaped tattoo appears across her chest. "Two months," the man says. "Give me an heir, or I will eat you."
Exploring Fixed Point in Image Editing: Theoretical Support and Convergence Optimization
In image editing, Denoising Diffusion Implicit Models (DDIM) inversion has become a widely adopted method and is extensively used in various image editing approaches. The core concept of DDIM inversion stems from the deterministic sampling technique of DDIM, which allows the DDIM process to be viewed as an Ordinary Differential Equation (ODE) process that is reversible. This enables the prediction of corresponding noise from a reference image, ensuring that the restored image from this noise remains consistent with the reference image. Image editing exploits this property by modifying the cross-attention between text and images to edit specific objects while preserving the remaining regions. However, in the DDIM inversion, using the $t-1$ time step to approximate the noise prediction at time step $t$ introduces errors between the restored image and the reference image.
0d18ab3b5fabfa6fe47c62e711af02f0-Supplemental-Conference.pdf
Inpractice,astronger 3D object detection method as ourDeepInteraction model is expected to reduce the potential accidents of self-driving cars. This improves the safety and reliability of autonomous driving. A.3 Limitations All the components for multi-modal fusion in ourDeepInteraction have no preference to any per-modalrepresentations. We will explore how to generate initial queries from both modalities (i.e., LiDAR'sbird-eyes-viewandcamera'sfront-view). The bounding boxesofground-truth and predictions are inthe color blue and green respectively.
You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection
Can Transformer perform $2\mathrm{D}$ object-and region-level recognition from a pure sequence-to-sequence perspective with minimal knowledge about the $2\mathrm{D}$ spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task. We find that YOLOS pre-trained on the mid-sized ImageNet-$1k$ dataset only can already achieve quite competitive performance on the challenging COCO object detection benchmark, e.g., YOLOS-Base directly adopted from BERT-Base architecture can obtain $42.0$ box AP on COCO val. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through YOLOS. Code and pre-trained models are available at https://github.com/hustvl/YOLOS.
How snake bites really work
Vipers can strike within 100 milliseconds of launching at their prey. Breakthroughs, discoveries, and DIY tips sent every weekday. A venomous snake bite is not something you ever want to encounter on a hiking or camping trip. For those brave scientists who study snakes-aka herpetologists -the mechanics behind the reptiles' fast fangs are more fascinating than fear-inducing. Snakes must move incredibly quickly to sink their fangs into prey before the victim flinches.
When blood hits clothes, physics takes over
Breakthroughs, discoveries, and DIY tips sent every weekday. Creating mock crime scene evidence can help forensic scientists better read the stories left behind by gruesome bloodstains. To decode some of these bloody stories, all a team from North Carolina State University needed was a combination of high-speed cameras, cotton fabrics, and a bit of pig's blood. Forensic science is a relatively new concept, historically speaking. There are multiple major moments in its development, but the field of study can largely be traced back 115 years ago to a man named Edmond Locard.
Exploring Fixed Point in Image Editing: Theoretical Support and Convergence Optimization
In image editing, Denoising Diffusion Implicit Models (DDIM) inversion has become a widely adopted method and is extensively used in various image editing approaches. The core concept of DDIM inversion stems from the deterministic sampling technique of DDIM, which allows the DDIM process to be viewed as an Ordinary Differential Equation (ODE) process that is reversible. This enables the prediction of corresponding noise from a reference image, ensuring that the restored image from this noise remains consistent with the reference image. Image editing exploits this property by modifying the cross-attention between text and images to edit specific objects while preserving the remaining regions. However, in the DDIM inversion, using the t-1 time step to approximate the noise prediction at time step t introduces errors between the restored image and the reference image.
Robust Deep Reinforcement Learning in Robotics via Adaptive Gradient-Masked Adversarial Attacks
Zhang, Zongyuan, Duan, Tianyang, Lin, Zheng, Huang, Dong, Fang, Zihan, Sun, Zekai, Xiong, Ling, Liang, Hongbin, Cui, Heming, Cui, Yong, Gao, Yue
Deep reinforcement learning (DRL) has emerged as a promising approach for robotic control, but its realworld deployment remains challenging due to its vulnerability to environmental perturbations. Existing white-box adversarial attack methods, adapted from supervised learning, fail to effectively target DRL agents as they overlook temporal dynamics and indiscriminately perturb all state dimensions, limiting their impact on long-term rewards. To address these challenges, we propose the Adaptive Gradient-Masked Reinforcement (AGMR) Attack, a white-box attack method that combines DRL with a gradient-based soft masking mechanism to dynamically identify critical state dimensions and optimize adversarial policies. AGMR selectively allocates perturbations to the most impactful state features and incorporates a dynamic adjustment mechanism to balance exploration and exploitation during training. Extensive experiments demonstrate that AGMR outperforms state-of-the-art adversarial attack methods in degrading the performance of the victim agent and enhances the victim agent's robustness through adversarial defense mechanisms.