tsr
"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
Zhang, Ziyi, Sun, Zhen, Zhang, Zongmin, Peng, Zifan, Zhao, Yuemeng, Wang, Zichun, Luo, Zeren, Zuo, Ruiting, He, Xinlei
The visually impaired population faces significant challenges in daily activities. While prior works employ vision language models for assistance, most focus on static content and cannot address real-time perception needs in complex environments. Recent VideoLLMs enable real-time vision and speech interaction, offering promising potential for assistive tasks. In this work, we conduct the first study evaluating their effectiveness in supporting daily life for visually impaired individuals. We first conducted a user survey with visually impaired participants to design the benchmark VisAssistDaily for daily life evaluation. Using VisAssistDaily, we evaluate popular VideoLLMs and find GPT-4o achieves the highest task success rate. We further conduct a user study to reveal concerns about hazard perception. To address this, we propose SafeVid, an environment-awareness dataset, and fine-tune VITA-1.5, improving risk recognition accuracy from 25.00% to 76.00%.We hope this work provides valuable insights and inspiration for future research in this field.
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (8 more...)
- Research Report (1.00)
- Questionnaire & Opinion Survey (1.00)
- Information Technology > Security & Privacy (1.00)
- Transportation > Infrastructure & Services (0.93)
- Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (0.69)
- Transportation > Ground > Road (0.46)
Towards 6G Native-AI Edge Networks: A Semantic-Aware and Agentic Intelligence Paradigm
Feng, Chenyuan, Zhang, Anbang, Min, Geyong, Huang, Yongming, Quek, Tony Q. S., You, Xiaohu
The evolution toward sixth-generation wireless systems positions intelligence as a native network capability, fundamentally transforming the design of radio access networks (RANs). Within this vision, Semantic-native communication and agentic intelligence are expected to play central roles. SemCom departs from bit-level fidelity and instead emphasizes task-oriented meaning exchange, enabling compact SC and introducing new performance measures such as semantic fidelity and task success rate. Agentic intelligence endows distributed RAN entities with goal-driven autonomy, reasoning, planning, and multi-agent collaboration, increasingly supported by foundation models and knowledge graphs. In this work, we first introduce the conceptual foundations of SemCom and agentic networking, and discuss why existing AI-driven O-RAN solutions remain largely bit-centric and task-siloed. We then present a unified taxonomy that organizes recent research along three axes: i) semantic abstraction level (symbol/feature/intent/knowledge), ii) agent autonomy and coordination granularity (single-, multi-, and hierarchical-agent), and iii) RAN control placement across PHY/MAC, near-real-time RIC, and non-real-time RIC. Based on this taxonomy, we systematically introduce enabling technologies including task-oriented semantic encoders/decoders, multi-agent reinforcement learning, foundation-model-assisted RAN agents, and knowledge-graph-based reasoning for cross-layer awareness. Representative 6G use cases, such as immersive XR, vehicular V2X, and industrial digital twins, are analyzed to illustrate the semantic-agentic convergence in practice. Finally, we identify open challenges in semantic representation standardization, scalable trustworthy agent coordination, O-RAN interoperability, and energy-efficient AI deployment, and outline research directions toward operational semantic-agentic AI-RAN.
- Asia > Singapore (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Europe > United Kingdom > England > Devon > Exeter (0.04)
- (2 more...)
- Research Report (0.50)
- Overview (0.46)
- Information Technology (0.67)
- Telecommunications (0.48)
- Information Technology > Communications > Networks (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)
AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI
Rana, Manik, Man, Calissa, Msiiwa, Anotida Expected, Paine, Jeffrey, Zhu, Kevin, Dev, Sunishchal, Sharma, Vasu, R, Ahan M
Goal changes are a defining feature of real world multi-turn interactions, yet current agent benchmarks primarily evaluate static objectives or one-shot tool use. We introduce AgentChangeBench, a benchmark explicitly designed to measure how tool augmented language model agents adapt to mid dialogue goal shifts across three enterprise domains. Our framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and Goal-Shift Recovery Time (GSRT) for adaptation latency. AgentChangeBench comprises 2,835 task sequences and five user personas, each designed to trigger realistic shift points in ongoing workflows. Using this setup, we evaluate several frontier models and uncover sharp contrasts obscured by traditional $\text{pass}@k$ scores: for example, GPT-4o reaches $92.2\%$ recovery on airline booking shifts while Gemini collapses to $48.6\%$, and retail tasks show near perfect parameter validity yet redundancy rates above $80\%$, revealing major inefficiencies. These findings demonstrate that high raw accuracy does not imply robustness under dynamic goals, and that explicit measurement of recovery time and redundancy is essential. AgentChangeBench establishes a reproducible testbed for diagnosing and improving agent resilience in realistic enterprise settings.
- Consumer Products & Services > Travel (1.00)
- Transportation > Passenger (0.93)
- Transportation > Air (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)
47a3893cc405396a5c30d91320572d6d-AuthorFeedback.pdf
We find that although Masking is very expensive, it does not perform well. We will include a more detailed discussion. S ( X) is network output i.e. S Upon your suggestion, we generated a new dataset with the advised setting. We will include this dataset along with more datasets generated using HMMs and state models in the benchmark. We will consider replacing GradientSHAP with SHAP in our final draft.
Temporal Score Rescaling for Temperature Sampling in Diffusion and Flow Models
Xu, Yanbo, Wu, Yu, Park, Sungjae, Zhou, Zhizhuo, Tulsiani, Shubham
Stanford University Figure 1: T emporal Score Rescaling (TSR) provides a mechanism to steer the sampling diversity of diffusion and flow models at inference. T op-left: Probability density evolution when sampling a 1D Gaussian mixture with DDPM, and the effects of TSR, which can control the sampling process to yield sharper or flatter distributions. T op-right, bottom: TSR can be applied to any pre-trained diffusion or flow model, improving performance across diverse domains such as pose prediction, depth estimation, and image generation. We present a mechanism to steer the sampling diversity of denoising diffusion and flow matching models, allowing users to sample from a sharper or broader distribution than the training distribution. We build on the observation that these models leverage (learned) score functions of noisy data distributions for sampling and show that rescaling these allows one to effectively control a'local' sampling temperature. Notably, this approach does not require any finetun-ing or alterations to training strategy, and can be applied to any off-the-shelf model and is compatible with both deterministic and stochastic samplers. We first validate our framework on toy 2D data, and then demonstrate its application for diffusion models trained across five disparate tasks - image generation, pose estimation, depth prediction, robot manipulation, and protein design. We find that across these tasks, our approach allows sampling from sharper (or flatter) distributions, yielding performance gains e.g., depth prediction models benefit from sampling more likely depth estimates, whereas image generation models perform better when sampling a slightly flatter distribution. Score-based generative models, such as denoising diffusion (Ho et al., 2020) and flow matching (Lipman et al., 2023; Liu et al., 2023b), have become ubiquitous across AI applications.
PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases
Vuddanti, Sri Vatsa, Shah, Aarav, Chittiprolu, Satwik Kumar, Song, Tony, Dev, Sunishchal, Zhu, Kevin, Chaudhary, Maheep
Tool-augmented language agents frequently fail in real-world deployment due to tool malfunctions--timeouts, API exceptions, or inconsistent outputs--triggering cascading reasoning errors and task abandonment. Existing agent training pipelines optimize only for success trajectories, failing to expose models to the tool failures that dominate real-world usage. We propose \textbf{PALADIN}, a generalizable framework for equipping language agents with robust failure recovery capabilities. PALADIN trains on 50,000+ recovery-annotated trajectories constructed via systematic failure injection and expert demonstrations on an enhanced ToolBench dataset. Training uses LoRA-based fine-tuning to retain base capabilities while injecting recovery competence. At inference, PALADIN detects execution-time errors and retrieves the most similar case from a curated bank of 55+ failure exemplars aligned with ToolScan's taxonomy, then executes the corresponding recovery action. This approach generalizes to novel failures beyond the training distribution, retaining 95.2\% recovery performance on unseen tool APIs. Evaluation across PaladinEval and ToolReflectEval demonstrates consistent improvements in Recovery Rate (RR), Task Success Rate (TSR), Catastrophic Success Rate (CSR), and Efficiency Score (ES). PALADIN improves RR from 32.76% to 89.68% (+57% relative) over ToolBench and outperforms the strongest baseline CRITIC (76.34%) by +13.3%. Against vanilla agents, PALADIN achieves 89.86\% RR (+66% relative improvement from 23.75%). These results establish PALADIN as an effective method for building fault-tolerant agents capable of robust recovery in real-world tool environments.
DSADF: Thinking Fast and Slow for Decision Making
Dou, Zhihao, Cui, Dongfei, Yan, Jun, Wang, Weida, Chen, Benteng, Wang, Haoming, Xie, Zeke, Zhang, Shufei
Although Reinforcement Learning (RL) agents are effective in well-defined environments, they often struggle to generalize their learned policies to dynamic settings due to their reliance on trial-and-error interactions. Recent work has explored applying Large Language Models (LLMs) or Vision Language Models (VLMs) to boost the generalization of RL agents through policy optimization guidance or prior knowledge. However, these approaches often lack seamless coordination between the RL agent and the foundation model, leading to unreasonable decision-making in unfamiliar environments and efficiency bottlenecks. Making full use of the inferential capabilities of foundation models and the rapid response capabilities of RL agents and enhancing the interaction between the two to form a dual system is still a lingering scientific question. To address this problem, we draw inspiration from Kahneman's theory of fast thinking (System 1) and slow thinking (System 2), demonstrating that balancing intuition and deep reasoning can achieve nimble decision-making in a complex world. In this study, we propose a Dual-System Adaptive Decision Framework (DSADF), integrating two complementary modules: System 1, comprising an RL agent and a memory space for fast and intuitive decision making, and System 2, driven by a VLM for deep and analytical reasoning. DSADF facilitates efficient and adaptive decision-making by combining the strengths of both systems. The empirical study in the video game environment: Crafter and Housekeep demonstrates the effectiveness of our proposed method, showing significant improvements in decision abilities for both unseen and known tasks.
- Workflow (1.00)
- Research Report > New Finding (1.00)
- Materials > Metals & Mining (1.00)
- Leisure & Entertainment > Games > Computer Games (0.86)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
E2E Parking Dataset: An Open Benchmark for End-to-End Autonomous Parking
Gao, Kejia, Zhou, Liguo, Liu, Mingjun, Knoll, Alois
--While traditional autonomous driving methods with multi-stage pipelines suffer from lengthy processes, error accumulations and maintenance difficulties, the end-to-end method is designed to map the data of multiple sensors directly into motion control commands, with high flexibility, efficiency and generalization. Therefore, the end-to-end model has shown great potential in autonomous driving. Due to the low speed, low risk, and low complexity characteristics of autonomous parking scenarios, end-to-end methods can be applied to autonomous parking systems earlier . While prior work introduced a visual-based parking model and a pipeline for data generation, training and closed-loop test, the dataset itself was not released. T o bridge this gap, we work on creating large end-to-end autonomous parking datasets in CARLA based on the prior work'E2E Parking'. Keyboard control is replaced by Handle Controller to improve usability, efficiency, and operational precision. During the iterative process of dataset generation, we evaluate the effect of different factors on the parking performance of the controlled vehicle, including diverse scenes generated by multiple random seeds, the position of the roadside object's shadow dependent on weather setting, dataset size, initial learning rate and training epochs. We recommend generating at least 2 scenes for each parking slot with different random seeds, where 8 trajectories with different initial positions are collected for each scene. Weather settings should be modified to make the dataset include scenes with shadow projected on the target slot. Experiments demonstrate that an initial learning rate of 7. 5 10 After several iterations, we are able to open-source a high-quality dataset for end-to-end autonomous parking.
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Transportation (0.88)
- Information Technology (0.55)
- Automobiles & Trucks (0.55)
Language-Conditioned Open-Vocabulary Mobile Manipulation with Pretrained Models
Tan, Shen, Zhou, Dong, Shao, Xiangyu, Wang, Junqiao, Sun, Guanghui
Open-vocabulary mobile manipulation (OVMM) that involves the handling of novel and unseen objects across different workspaces remains a significant challenge for real-world robotic applications. In this paper, we propose a novel Language-conditioned Open-V ocabulary Mobile Manipulation framework, named LOVMM, incorporating the large language model (LLM) and vision-language model (VLM) to tackle various mobile manipulation tasks in household environments. "toss the food boxes on the office room desk to the trash bin in the corner", and "pack the bottles from the bed to the box in the guestroom"). Extensive experiments simulated in complex household environments show strong zero-shot generalization and multi-task learning abilities of LOVMM. Moreover, our approach can also generalize to multiple tabletop manipulation tasks and achieve better success rates compared to other state-of-the-art methods. 1 Introduction As one of the key capabilities for robotic home assistance, open-vocabulary mobile manipulation (OVMM), which leverages vision cameras to navigate in the environment and execute human-like actions to manipulate unseen objects, has attracted wide attention. It is crucial for addressing real-world challenges such as object sorting and rearrangement [ Zeng et al., 2022 ], [ Gan et al., 2022 ], household cleanup [ Y anet al., 2021 ], [ Wu et al., 2023 ], and human assistance [ Y enamandraet al., 2023 ], [ Stone et al., 2023 ] . Traditionally, robotic manipulation relies on vision-based methods that use explicit, object-centric representations, including poses, categories, and instance segmentations for perception [ Pan et al., 2023 ], [ Geng et al., 2023a ], [ Xie et al., 2020] . Recently, end-to-end models that learn from expert demonstrations have emerged as promising alternatives [ Zeng et al., 2021 ], [ Seita et al., 2021 ], [ Geng et al., 2023b ] . By leveraging visual observations without any explicit object information, these models are able to extract more generalizable representations across different tasks and zero-shot adapt to unseen scenarios. Y et, such methods are limited by the insufficient information provided by the single-modal data, or they may require goal images as instructions to adapt to new situations.
- Research Report (1.00)
- Instructional Material > Course Syllabus & Notes (0.68)