camera angle
Appendix
A.1 Details of Dimension Design We argue that multi-dimensional evaluation is significant to visual caption evaluation and is more comprehensive than previous work. So how to choose proper dimensions? We refer to existing VQA benchmarks [62, 63, 64, 65] and visual generation benchmarks [31, 32, 33]. VQA benchmarks usually design various types of questions to include multi-dimensional evaluation and analysis of MLLMs. For instance, MMBench [64] defines 20 ability dimensions, including attribute recognition, attribute comparison, action recognition, spatial relationship, physical property, OCR, object localization, image style, image scene, identity reasoning, etc. MVBench [64] covers 20 challenging video tasks including action, object, position, count, scene, pose, attribute, character, cognition, etc. Due to the flexible design of questions, VQA benchmarks can be naturally built with comprehensive dimensions. Different from the VQA task, the visual caption task does not require specific questions, but inspects the alignment of visual and textual information. Visual generation is the inverse task of visual captioning, as it requires models to generate specific visual content based on detailed textual descriptions. GenEval [31] designs 6 different tasks to evaluate text-to-image alignment, including single object, two object, counting, colors, position, and attribute binding. VBench [32] comprises 16 dimensions, including subject consistency, background consistency, object class, human action, color, spatial relationship, scene, style, etc. We follow their explored dimensions to design proper dimensions for visual captioning. Finally, we design 6 views, covering object, global, text, camera, temporal, and knowledge. The object-related view includes object category, object color, object 1 number, and spatial relation, the global-related view includes scene and style, the text-related view evaluates the OCR capability of captions, the camera-related view covers the camera angle and movement, the temporal-related view contains action and event, and we also design a view to evaluate the knowledge of MLLMs, i.e., character identification. We believe these dimensions contribute to a comprehensive visual caption benchmarking.
5f2809607f692d79a01c05c43d702883-Paper-Datasets_and_Benchmarks_Track.pdf
V multimodal isual captioning large benchma language rks models have become (MLLMs), outdated as the with brief the ground-truth emergence of sentences modern and benchmarks centric incomplete traditional evaluation, visual attempt metri the elem cs to y address remain f ent ail co to v assess limited erage.
A Deep Learning-Based CCTV System for Automatic Smoking Detection in Fire Exit Zones
Sadat, Sami, Hossain, Mohammad Irtiza, Sifat, Junaid Ahmed, Rafi, Suhail Haque, Alvi, Md. Waseq Alauddin, Rhaman, Md. Khalilur
A deep learning real-time smoking detection system for CCTV surveillance of fire exit areas is proposed in this research due to its critical safety requirements. The dataset contained 8,124 images which came from 20 different scenarios along with images from 2,708 raw samples demonstrating low-light areas. We implemented an evaluation of three advanced object detection models which included YOLOv8 and YOLOv11 and YOLOv12 followed by development of our custom model that derived its design from YOLOv8 through added structures for facing demanding surveillance contexts. The proposed model outperformed other evaluated models by reaching recall of 78.90% and mAP@50 of 83.70% to deliver optimal object identification and detection results across different environments. A performance evaluation for inference involved analysing multiple edge devices through mul-tithreaded operations. The Jetson Xavier NX processed information at the fastest real-time rate of 52-97 ms which established its suitability for time-sensitive operations. The study establishes the proposed system delivers a fair and adjustable platform to monitor public safety processes while enabling automatic regulatory compliance checks.
Visio-Verbal Teleimpedance Interface: Enabling Semi-Autonomous Control of Physical Interaction via Eye Tracking and Speech
Jekel, Henk H. A., Rosales, Alejandro Dรญaz, Peternel, Luka
The paper presents a visio-verbal teleimpedance interface for commanding 3D stiffness ellipsoids to the remote robot with a combination of the operator's gaze and verbal interaction. The gaze is detected by an eye-tracker, allowing the system to understand the context in terms of what the operator is currently looking at in the scene. Along with verbal interaction, a Visual Language Model (VLM) processes this information, enabling the operator to communicate their intended action or provide corrections. Based on these inputs, the interface can then generate appropriate stiffness matrices for different physical interaction actions. To validate the proposed visio-verbal teleimpedance interface, we conducted a series of experiments on a setup including a Force Dimension Sigma.7 haptic device to control the motion of the remote Kuka LBR iiwa robotic arm. The human operator's gaze is tracked by Tobii Pro Glasses 2, while human verbal commands are processed by a VLM using GPT-4o. The first experiment explored the optimal prompt configuration for the interface. The second and third experiments demonstrated different functionalities of the interface on a slide-in-the-groove task.
Terrain-Aware Adaptation for Two-Dimensional UAV Path Planners
Karakontis, Kostas, Petsanis, Thanos, Kapoutsis, Athanasios Ch., Kapoutsis, Pavlos Ch., Kosmatopoulos, Elias B.
-- Multi-UA V Coverage Path Planning (mCPP) algorithms in popular commercial software typically treat a Region of Interest (RoI) only as a 2D plane, ignoring important 3D structure characteristics. This leads to incomplete 3D reconstructions, especially around occluded or vertical surfaces. In this paper, we propose a modular algorithm that can extend commercial two-dimensional path planners to facilitate terrain-aware planning by adjusting altitude and camera orientations. T o demonstrate it, we extend the well-known DARP (Divide Areas for Optimal Multi-Robot Coverage Path Planning) algorithm and produce DARP-3D. Compared to baseline, our approach consistently captures improved 3D reconstructions, particularly in areas with significant vertical features. An open-source implementation of the algorithm is available here: https://github.com/konskara/T
DJI Air 3S review: LiDAR and improved image quality make for a nearly faultless drone
DJI just announced the dual-camera Air 3S drone and there's some all-new cutting-edge tech hiding in the nose. A LiDAR sensor is there to provide extra crash protection at night, a time that's often dangerous for drones. The Air 3S also has a new main camera with a larger sensor better suited for capturing video in low-light. And it now comes with the company's ActiveTrack 360, which it first introduced in the Mini 4 Pro, allowing the device to zoom all around your subject while tracking and filming them. There are a bunch of other little improvements, from storage to the new panoramic photo mode, all at the same 1,099 price as the Air 3 was at launch.
Optimizing Parking Space Classification: Distilling Ensembles into Lightweight Classifiers
Alves, Paulo Luza, Hochuli, Andrรฉ, de Oliveira, Luiz Eduardo, de Almeida, Paulo Lisboa
When deploying large-scale machine learning models for smart city applications, such as image-based parking lot monitoring, data often must be sent to a central server to perform classification tasks. This is challenging for the city's infrastructure, where image-based applications require transmitting large volumes of data, necessitating complex network and hardware infrastructures to process the data. To address this issue in image-based parking space classification, we propose creating a robust ensemble of classifiers to serve as Teacher models. These Teacher models are distilled into lightweight and specialized Student models that can be deployed directly on edge devices. The knowledge is distilled to the Student models through pseudo-labeled samples generated by the Teacher model, which are utilized to fine-tune the Student models on the target scenario. Our results show that the Student models, with 26 times fewer parameters than the Teacher models, achieved an average accuracy of 96.6% on the target test datasets, surpassing the Teacher models, which attained an average accuracy of 95.3%.
Online Distribution Shift Detection via Recency Prediction
Luo, Rachel, Sinha, Rohan, Sun, Yixiao, Hindy, Ali, Zhao, Shengjia, Savarese, Silvio, Schmerling, Edward, Pavone, Marco
When deploying modern machine learning-enabled robotic systems in high-stakes applications, detecting distribution shift is critical. However, most existing methods for detecting distribution shift are not well-suited to robotics settings, where data often arrives in a streaming fashion and may be very high-dimensional. In this work, we present an online method for detecting distribution shift with guarantees on the false positive rate - i.e., when there is no distribution shift, our system is very unlikely (with probability $< \epsilon$) to falsely issue an alert; any alerts that are issued should therefore be heeded. Our method is specifically designed for efficient detection even with high dimensional data, and it empirically achieves up to 11x faster detection on realistic robotics settings compared to prior work while maintaining a low false negative rate in practice (whenever there is a distribution shift in our experiments, our method indeed emits an alert). We demonstrate our approach in both simulation and hardware for a visual servoing task, and show that our method indeed issues an alert before a failure occurs.
A View Independent Classification Framework for Yoga Postures
Chasmai, Mustafa, Das, Nirjhar, Bhardwaj, Aman, Garg, Rahul
Yoga is a globally acclaimed and widely recommended practice for a healthy living. Maintaining correct posture while performing a Yogasana is of utmost importance. In this work, we employ transfer learning from Human Pose Estimation models for extracting 136 key-points spread all over the body to train a Random Forest classifier which is used for estimation of the Yogasanas. The results are evaluated on an in-house collected extensive yoga video database of 51 subjects recorded from 4 different camera angles. We propose a 3 step scheme for evaluating the generalizability of a Yoga classifier by testing it on 1) unseen frames, 2) unseen subjects, and 3) unseen camera angles. We argue that for most of the applications, validation accuracies on unseen subjects and unseen camera angles would be most important. We empirically analyze over three public datasets, the advantage of transfer learning and the possibilities of target leakage. We further demonstrate that the classification accuracies critically depend on the cross validation method employed and can often be misleading. To promote further research, we have made key-points dataset and code publicly available.
DeepDarts: Modeling Keypoints as Objects for Automatic Scorekeeping in Darts using a Single Camera
McNally, William, Walters, Pascale, Vats, Kanav, Wong, Alexander, McPhee, John
Existing multi-camera solutions for automatic scorekeeping in steel-tip darts are very expensive and thus inaccessible to most players. Motivated to develop a more accessible low-cost solution, we present a new approach to keypoint detection and apply it to predict dart scores from a single image taken from any camera angle. This problem involves detecting multiple keypoints that may be of the same class and positioned in close proximity to one another. The widely adopted framework for regressing keypoints using heatmaps is not well-suited for this task. To address this issue, we instead propose to model keypoints as objects. We develop a deep convolutional neural network around this idea and use it to predict dart locations and dartboard calibration points within an overall pipeline for automatic dart scoring, which we call DeepDarts. Additionally, we propose several task-specific data augmentation strategies to improve the generalization of our method. As a proof of concept, two datasets comprising 16k images originating from two different dartboard setups were manually collected and annotated to evaluate the system. In the primary dataset containing 15k images captured from a face-on view of the dartboard using a smartphone, DeepDarts predicted the total score correctly in 94.7% of the test images. In a second more challenging dataset containing limited training data (830 images) and various camera angles, we utilize transfer learning and extensive data augmentation to achieve a test accuracy of 84.0%. Because DeepDarts relies only on single images, it has the potential to be deployed on edge devices, giving anyone with a smartphone access to an automatic dart scoring system for steel-tip darts. The code and datasets are available.