error tolerance
While recognizing actions, LMMs struggle to detect core interaction events
Harari, Daniel, Sidorov, Michael, David, Liel, Shterental, Chen, Gebreselasie, Abrham Kahsay, Khan, Muhammad Haris
Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached ('contact') or detached ('release'). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.
vAttention: Verified Sparse Attention
Desai, Aditya, Agrawal, Kumar Krishna, Yang, Shuo, Cuadron, Alejandro, Schroeder, Luis Gaspar, Zaharia, Matei, Gonzalez, Joseph E., Stoica, Ion
State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(ε, δ)$ guarantees on approximation accuracy (thus, verified). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-k and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama-3.1-8B-Inst and Deepseek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with upto 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code is open-sourced at https://github.com/xAlg-ai/sparse-attention-hub.
Empart: Interactive Convex Decomposition for Converting Meshes to Parts
Vu, Brandon, Ganguly, Shameek, Joshi, Pushkar
Simplifying complex 3D meshes is a crucial step in robotics applications to enable efficient motion planning and physics simulation. Common methods, such as approximate convex decomposition, represent a mesh as a collection of simple parts, which are computationally inexpensive to simulate. However, existing approaches apply a uniform error tolerance across the entire mesh, which can result in a sub-optimal trade-off between accuracy and performance. For instance, a robot grasping an object needs high-fidelity geometry in the vicinity of the contact surfaces but can tolerate a coarser simplification elsewhere. A uniform tolerance can lead to excessive detail in non-critical areas or insufficient detail where it's needed most. To address this limitation, we introduce Empart, an interactive tool that allows users to specify different simplification tolerances for selected regions of a mesh. Our method leverages existing convex decomposition algorithms as a sub-routine but uses a novel, parallelized framework to handle region-specific constraints efficiently. Empart provides a user-friendly interface with visual feedback on approximation error and simulation performance, enabling designers to iteratively refine their decomposition. We demonstrate that our approach significantly reduces the number of convex parts compared to a state-of-the-art method (V-HACD) at a fixed error threshold, leading to substantial speedups in simulation performance. For a robotic pick-and-place task, Empart-generated collision meshes reduced the overall simulation time by 69% compared to a uniform decomposition, highlighting the value of interactive, region-specific simplification for performant robotics applications.
Video QoE Metrics from Encrypted Traffic: Application-agnostic Methodology
Berger, Tamir, Sterenson, Jonathan, Birman, Raz, Hadar, Ofer
--Instant Messaging-Based Video Call Applications (IMVCAs) and Video Conferencing Applications (VCAs) have become integral to modern communication. Ensuring a high Quality of Experience (QoE) for users in this context is critical for network operators, as network conditions significantly impact user QoE. However, network operators lack access to end-device QoE metrics due to encrypted traffic. Existing solutions estimate QoE metrics from encrypted traffic traversing the network, with the most advanced approaches leveraging machine learning models. Subsequently, the need for ground truth QoE metrics for training and validation poses a challenge, as not all video applications provide these metrics. T o address this challenge, we propose an application-agnostic approach for objective QoE estimation from encrypted traffic. Independent of the video application, we obtained key video QoE metrics, enabling broad applicability to various proprietary IMVCAs and VCAs. T o validate our solution, we created a diverse dataset from WhatsApp video sessions under various network conditions, comprising 25,680 seconds of traffic data and QoE metrics. Our evaluation shows high performance across the entire dataset, with 85.2% accuracy for FPS predictions within an error margin of two FPS, and 90.2% accuracy for PIQE-based quality rating classification.
Towards Statistical Factuality Guarantee for Large Vision-Language Models
Li, Zhuohang, Yan, Chao, Jackson, Nicholas J., Cui, Wendi, Li, Bo, Zhang, Jiaxin, Malin, Bradley A.
Advancements in Large Vision-Language Models (LVLMs) have demonstrated promising performance in a variety of vision-language tasks involving image-conditioned free-form text generation. However, growing concerns about hallucinations in LVLMs, where the generated text is inconsistent with the visual context, are becoming a major impediment to deploying these models in applications that demand guaranteed reliability. In this paper, we introduce a framework to address this challenge, ConfLVLM, which is grounded on conformal prediction to achieve finite-sample distribution-free statistical guarantees on the factuality of LVLM output. This framework treats an LVLM as a hypothesis generator, where each generated text detail (or claim) is considered an individual hypothesis. It then applies a statistical hypothesis testing procedure to verify each claim using efficient heuristic uncertainty measures to filter out unreliable claims before returning any responses to users. We conduct extensive experiments covering three representative application domains, including general scene understanding, medical radiology report generation, and document understanding. Remarkably, ConfLVLM reduces the error rate of claims generated by LLaVa-1.5 for scene descriptions from 87.8\% to 10.0\% by filtering out erroneous claims with a 95.3\% true positive rate. Our results further demonstrate that ConfLVLM is highly flexible, and can be applied to any black-box LVLMs paired with any uncertainty measure for any image-conditioned free-form text generation task while providing a rigorous guarantee on controlling the risk of hallucination.