image frame
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > United Kingdom (0.04)
- Media (1.00)
- Leisure & Entertainment > Sports > Tennis (0.46)
Explainable Deep Anomaly Detection with Sequential Hypothesis Testing for Robotic Sewer Inspection
George, Alex, Shepherd, Will, Tait, Simon, Mihaylova, Lyudmila, Anderson, Sean R.
Sewer pipe faults, such as leaks and blockages, can lead to severe consequences including groundwater contamination, property damage, and service disruption. Traditional inspection methods rely heavily on the manual review of CCTV footage collected by mobile robots, which is inefficient and susceptible to human error. To automate this process, we propose a novel system incorporating explainable deep learning anomaly detection combined with sequential probability ratio testing (SPRT). The anomaly detector processes single image frames, providing interpretable spatial localisation of anomalies, whilst the SPRT introduces temporal evidence aggregation, enhancing robustness against noise over sequences of image frames. Experimental results demonstrate improved anomaly detection performance, highlighting the benefits of the combined spatiotemporal analysis system for reliable and robust sewer inspection.
- Europe > United Kingdom > England > South Yorkshire > Sheffield (0.05)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
Fluoroscopic Shape and Pose Tracking of Catheters with Custom Radiopaque Markers
Lawson, Jared, Chitale, Rohan, Simaan, Nabil
--Safe navigation of steerable and robotic catheters in the cerebral vasculature requires awareness of the catheter's shape and pose. Currently, a significant perception burden is placed on interventionalists to mentally reconstruct and predict catheter motions from biplane fluoroscopy images. Efforts to track these catheters are limited to planar segmentation or bulky sensing instrumentation, which are incompatible with microcatheters used in neurointervention. In this work, a catheter is equipped with custom radiopaque markers arranged to enable simultaneous shape and pose estimation under biplane fluoroscopy. A design measure is proposed to guide the arrangement of these markers to minimize sensitivity to marker tracking uncertainty. Endovascular neurosurgery is a rapidly growing domain which enables treatment of cerebrovascular disease with minimally-invasive approaches. Among the most common endovascular neurointerventions include aneurysm coiling and mechanical thrombectomy (MT), which has become the gold standard for treating strokes caused by large vessel occlusions (L VOs).
- North America > United States > Tennessee > Davidson County > Nashville (0.04)
- North America > United States > Massachusetts > Middlesex County > Waltham (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
- Health & Medicine > Health Care Equipment & Supplies (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (0.94)
MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos
Zang, Yuan, Tan, Hao, Yoon, Seunghyun, Dernoncourt, Franck, Gu, Jiuxiang, Kafle, Kushal, Sun, Chen, Bui, Trung
We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state-of-the-art multi-modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (2 more...)
- Education > Educational Technology > Media (1.00)
- Education > Educational Technology > Audio & Video (1.00)
Multi-Modal Framing Analysis of News
Arora, Arnav, Yadav, Srishti, Antoniak, Maria, Belongie, Serge, Augenstein, Isabelle
Automated frame analysis of political communication is a popular task in computational social science that is used to study how authors select aspects of a topic to frame its reception. So far, such studies have been narrow, in that they use a fixed set of pre-defined frames and focus only on the text, ignoring the visual contexts in which those texts appear. Especially for framing in the news, this leaves out valuable information about editorial choices, which include not just the written article but also accompanying photographs. To overcome such limitations, we present a method for conducting multi-modal, multi-label framing analysis at scale using large (vision-)language models. Grounding our work in framing theory, we extract latent meaning embedded in images used to convey a certain point and contrast that to the text by comparing the respective frames used. We also identify highly partisan framing of topics with issue-specific frame analysis found in prior qualitative work. We demonstrate a method for doing scalable integrative framing analysis of both text and image in news, providing a more complete picture for understanding media bias.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Israel (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- (26 more...)
- Media > News (1.00)
- Leisure & Entertainment (1.00)
- Law > Criminal Law (1.00)
- (7 more...)
Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models
Li, Zhaoxin, Xi-Jia, Zhang, Altundas, Batuhan, Chen, Letian, Paleja, Rohan, Gombolay, Matthew
Semantic Interpretability in Reinforcement Learning (RL) enables transparency, accountability, and safer deployment by making the agent's decisions understandable and verifiable. Achieving this, however, requires a feature space composed of human-understandable concepts, which traditionally rely on human specification and fail to generalize to unseen environments. In this work, we introduce Semantically Interpretable Reinforcement Learning with Vision-Language Models Empowered Automation (SILVA), an automated framework that leverages pre-trained vision-language models (VLM) for semantic feature extraction and interpretable tree-based models for policy optimization. SILVA first queries a VLM to identify relevant semantic features for an unseen environment, then extracts these features from the environment. Finally, it trains an Interpretable Control Tree via RL, mapping the extracted features to actions in a transparent and interpretable manner. To address the computational inefficiency of extracting features directly with VLMs, we develop a feature extraction pipeline that generates a dataset for training a lightweight convolutional network, which is subsequently used during RL. By leveraging VLMs to automate tree-based RL, SILVA removes the reliance on human annotation previously required by interpretable models while also overcoming the inability of VLMs alone to generate valid robot policies, enabling semantically interpretable reinforcement learning without human-in-the-loop.
- North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
- North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
A Robust and Efficient Visual-Inertial Initialization with Probabilistic Normal Epipolar Constraint
Mu, Changshi, Feng, Daquan, Zheng, Qi, Zhuang, Yuan
Accurate and robust initialization is essential for Visual-Inertial Odometry (VIO), as poor initialization can severely degrade pose accuracy. During initialization, it is crucial to estimate parameters such as accelerometer bias, gyroscope bias, initial velocity, and gravity, etc. The IMU sensor requires precise estimation of gyroscope bias because gyroscope bias affects rotation, velocity and position. Most existing VIO initialization methods adopt Structure from Motion (SfM) to solve for gyroscope bias. However, SfM is not stable and efficient enough in fast motion or degenerate scenes. To overcome these limitations, we extended the rotation-translation-decoupling framework by adding new uncertainty parameters and optimization modules. First, we adopt a gyroscope bias optimizer that incorporates probabilistic normal epipolar constraints. Second, we fuse IMU and visual measurements to solve for velocity, gravity, and scale efficiently. Finally, we design an additional refinement module that effectively diminishes gravity and scale errors. Extensive initialization tests on the EuRoC dataset show that our method reduces the gyroscope bias and rotation estimation error by an average of 16% and 4% respectively. It also significantly reduces the gravity error, with an average reduction of 29%.
- Asia > China > Hubei Province > Wuhan (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (2 more...)
Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector
Dagar, Deepak, Vishwakarma, Dinesh Kumar
Deepfakes, which employ Generative Adversarial Networks (GANs) to produce highly realistic facial modification, are widely regarded as the prevailing method. Traditional Convolutional Neural Networks (CNNs) have been able to identify bogus media, but they struggle to perform well on different datasets and are vulnerable to adversarial attacks due to their lack of robustness. Vision transformers have demonstrated potential in the realm of image classification problems, but they require enough training data. Motivated by these limitations, this publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet (Residual Networks) with a vision transformer. The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation. The texture module then serves as an input to the dual branch of the cross-attention vision transformer. It specifically focuses on improving the global texture module, which extracts feature map correlation. Empirical analysis reveals that fake images exhibit smooth textures that do not remain consistent over long distances in manipulations. Experiments were performed on different categories of FaceForensics++ (FF++), such as Deepfakes (DF), Face2Face (f2f), Faceswap (FS), and Neural Texture (NT), together with other types of GAN datasets in cross-domain scenarios. Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF dataset underwent several post-processing situations, such as blurring, compression, and noise. The model surpassed the most advanced models in terms of generalization, achieving a 98% accuracy in cross-domain scenarios. This demonstrates its ability to learn the shared distinguishing textural characteristics in the manipulated samples. These experiments provide evidence that the proposed model is capable of being applied to various situations and is resistant to many postprocessing procedures. Keywords: Deepfake detector, Texture, Gram matrices, Generalization, Robustness. 1 Introduction With the advancements in technology, especially GANs, it is possible to generate highly realistic content that can easily deceive the naked eye. Deepfake is a current state of the art of visual and audio manipulation. Deepfake is a technology where highly astonishing, realistic, and believable content is created using deep learning technology (Figure 1). Visual deepfakes can be classified into five categories: lip sync, attribute manipulation, full-image synthesis, body re-enactment, and face ap [1]. The application of the deepfake has benefitted the education and entertainment industry in various ways.
- North America > Canada > Quebec > Montreal (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (16 more...)
An Efficient and Explanatory Image and Text Clustering System with Multimodal Autoencoder Architecture
Shi, Tiancheng, Wei, Yuanchen, Kender, John R.
We demonstrate the efficiencies and explanatory abilities of extensions to the common tools of Autoencoders and LLM interpreters, in the novel context of comparing different cultural approaches to the same international news event. We develop a new Convolutional-Recurrent Variational Autoencoder (CRVAE) model that extends the modalities of previous CVAE models, by using fully-connected latent layers to embed in parallel the CNN encodings of video frames, together with the LSTM encodings of their related text derived from audio. We incorporate the model within a larger system that includes frame-caption alignment, latent space vector clustering, and a novel LLM-based cluster interpreter. We measure, tune, and apply this system to the task of summarizing a video into three to five thematic clusters, with each theme described by ten LLM-produced phrases. We apply this system to two news topics, COVID-19 and the Winter Olympics, and five other topics are in progress.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- South America > Argentina (0.04)
- (4 more...)
- Leisure & Entertainment > Sports (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Therapeutic Area > Immunology (1.00)