visual data
Ambiguous Images With Human Judgments for Robust Visual Event Classification
Contemporary vision benchmarks predominantly consider tasks on which humans can achieve near-perfect performance. However, humans are frequently presented with visual data that they cannot classify with 100% certainty, and models trained on standard vision benchmarks achieve low performance when evaluated on this data. To address this issue, we introduce a procedure for creating datasets of ambiguous images and use it to produce SQUID-E (Squidy), a collection of noisy images extracted from videos. All images are annotated with ground truth values and a test set is annotated with human uncertainty judgments. We use this dataset to characterize human uncertainty in vision tasks and evaluate existing visual event classification models. Experimental results suggest that existing vision models are not sufficiently equipped to provide meaningful outputs for ambiguous images and that datasets of this nature can be used to assess and improve such models through model training and direct evaluation of model calibration. These findings motivate large-scale ambiguous dataset creation and further research focusing on noisy visual data.
- North America > United States > Virginia (0.04)
- North America > United States > Indiana (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Europe > Spain > Andalusia > Granada Province > Granada (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- North America > United States > Virginia (0.04)
- North America > United States > Indiana (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Europe > Spain > Andalusia > Granada Province > Granada (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System
Rezaei, Abdolazim, Sookhak, Mehdi, Haghparast, Mahboobeh
RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System 1 st Abdolazim Rezaei Department of Computer Science T exas A&M University Corpus Christi, USA 2 nd Mehdi Sookhak Department of Computer Science T exas A&M University Corpus Christi, USA 3 rd Mahboobeh Haghparast Department of Computer Science T exas A&M University Corpus Christi, USA Abstract --The proliferation of AI-powered cameras in Intelligent Transportation Systems (ITS) creates a severe conflict between the need for rich visual data and the right to privacy. Existing privacy-preserving methods, such as blurring or encryption, are often insufficient due to creating an undesirable trade-off where either privacy is compromised against advanced reconstruction attacks or data utility is critically degraded. T o resolve this challenge, we propose RL-MoE, a novel framework that transforms sensitive visual data into privacy-preserving textual descriptions, eliminating the need for direct image transmission. RL-MoE uniquely combines a Mixture-of-Experts (MoE) architecture for nuanced, multi-aspect scene decomposition with a Reinforcement Learning (RL) agent that optimizes the generated text for a dual objective of semantic accuracy and privacy preservation. Extensive experiments demonstrate that RL-MoE provides superior privacy protection, reducing the success rate of replay attacks to just 9.4% on the CFP-FP dataset, while simultaneously generating richer textual content than baseline methods. Our work provides a practical and scalable solution for building trustworthy AI systems in privacy-sensitive domains, paving the way for more secure smart city and autonomous vehicle networks. I NTRODUCTION The growing integration of artificial intelligence (AI) and Internet of Things (IoT) technologies in intelligent transportation systems (ITS) has significantly enhanced the capabilities of urban mobility management. From traffic monitoring and congestion analysis to automated violation detection and smart infrastructure planning, ITS plays a pivotal role in shaping the future of transportation. A key component of these systems is the use of roadside cameras, which continuously capture visual data to enable real-time decision-making and improve road safety.
- Oceania > Australia > Queensland (0.04)
- North America > United States > Maryland (0.04)
- North America > United States > California (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Security & Privacy (1.00)
- Transportation > Ground > Road (0.34)
Unified Multimodal Understanding via Byte-Pair Visual Encoding
Zhang, Wanpeng, Feng, Yicheng, Luo, Hao, Li, Yijiang, Yue, Zihao, Zheng, Sipeng, Lu, Zongqing
Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates structural information into visual tokens, mirroring successful tokenization strategies in text-only language models. We introduce a priority-guided encoding scheme that considers both frequency and spatial consistency, coupled with a multi-stage training procedure based on curriculum-driven data composition. These enhancements enable the transformer model to better capture cross-modal relationships and reason with visual information. Comprehensive experiments demonstrate improved performance across diverse vision-language tasks. By bridging the gap between visual and textual representations, our approach contributes to the advancement of more capable and efficient multimodal foundation models.
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe (0.04)
- Asia > China (0.04)
PerfCam: Digital Twinning for Production Lines Using 3D Gaussian Splatting and Vision Models
Khan, Michel Gokan, Guarese, Renan, Johnson, Fabian, Wang, Xi Vincent, Bergman, Anders, Edvinsson, Benjamin, Romero, Mario, Vachier, Jérémy, Kronqvist, Jan
We introduce PerfCam, an open source Proof-of-Concept (PoC) digital twinning framework that combines camera and sensory data with 3D Gaussian Splatting and computer vision models for digital twinning, object tracking, and Key Performance Indicators (KPIs) extraction in industrial production lines. By utilizing 3D reconstruction and Convolutional Neural Networks (CNNs), PerfCam offers a semi-automated approach to object tracking and spatial mapping, enabling digital twins that capture real-time KPIs such as availability, performance, Overall Equipment Effectiveness (OEE), and rate of conveyor belts in the production line. We validate the effectiveness of PerfCam through a practical deployment within realistic test production lines in the pharmaceutical industry and contribute an openly published dataset to support further research and development in the field. The results demonstrate PerfCam's ability to deliver actionable insights through its precise digital twin capabilities, underscoring its value as an effective tool for developing usable digital twins in smart manufacturing environments and extracting operational analytics.
- Europe > Sweden > Stockholm > Stockholm (0.04)
- Asia > Singapore (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (3 more...)