AITopics | image frame

Collaborating Authors

image frame

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Supplemental Material

Neural Information Processing SystemsAug-17-2025, 07:36:12 GMT

We found channels using Y ouTube's auto-generated'topic' pages, corresponding to entries in

data mining, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > United Kingdom (0.04)

Genre: Research Report (0.46)

Industry:

Media (1.00)
Leisure & Entertainment > Sports > Tennis (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Data Science > Data Mining (0.67)

Add feedback

Explainable Deep Anomaly Detection with Sequential Hypothesis Testing for Robotic Sewer Inspection

George, Alex, Shepherd, Will, Tait, Simon, Mihaylova, Lyudmila, Anderson, Sean R.

arXiv.org Artificial IntelligenceJul-31-2025

Sewer pipe faults, such as leaks and blockages, can lead to severe consequences including groundwater contamination, property damage, and service disruption. Traditional inspection methods rely heavily on the manual review of CCTV footage collected by mobile robots, which is inefficient and susceptible to human error. To automate this process, we propose a novel system incorporating explainable deep learning anomaly detection combined with sequential probability ratio testing (SPRT). The anomaly detector processes single image frames, providing interpretable spatial localisation of anomalies, whilst the SPRT introduces temporal evidence aggregation, enhancing robustness against noise over sequences of image frames. Experimental results demonstrate improved anomaly detection performance, highlighting the benefits of the combined spatiotemporal analysis system for reliable and robust sewer inspection.

artificial intelligence, data mining, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2507.22546

Country: Europe > United Kingdom (0.15)

Genre: Research Report (0.70)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.99)

Add feedback

Fluoroscopic Shape and Pose Tracking of Catheters with Custom Radiopaque Markers

Lawson, Jared, Chitale, Rohan, Simaan, Nabil

arXiv.org Artificial IntelligenceJun-18-2025

--Safe navigation of steerable and robotic catheters in the cerebral vasculature requires awareness of the catheter's shape and pose. Currently, a significant perception burden is placed on interventionalists to mentally reconstruct and predict catheter motions from biplane fluoroscopy images. Efforts to track these catheters are limited to planar segmentation or bulky sensing instrumentation, which are incompatible with microcatheters used in neurointervention. In this work, a catheter is equipped with custom radiopaque markers arranged to enable simultaneous shape and pose estimation under biplane fluoroscopy. A design measure is proposed to guide the arrangement of these markers to minimize sensitivity to marker tracking uncertainty. Endovascular neurosurgery is a rapidly growing domain which enables treatment of cerebrovascular disease with minimally-invasive approaches. Among the most common endovascular neurointerventions include aneurysm coiling and mechanical thrombectomy (MT), which has become the gold standard for treating strokes caused by large vessel occlusions (L VOs).

artificial intelligence, catheter, fluoroscopy, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/LRA.2025.3581043

2506.09934

Country:

Europe (0.68)
North America > United States (0.46)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Health Care Equipment & Supplies (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.94)

Technology:

Information Technology > Sensing and Signal Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

Add feedback

MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos

Zang, Yuan, Tan, Hao, Yoon, Seunghyun, Dernoncourt, Franck, Gu, Jiuxiang, Kafle, Kushal, Sun, Chen, Bui, Trung

arXiv.org Artificial IntelligenceJun-17-2025

We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state-of-the-art multi-modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.12623

Country: Europe > Switzerland (0.28)

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Education > Educational Technology > Media (1.00)
Education > Educational Technology > Audio & Video (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Human Computer Interaction (0.85)

Add feedback

Multi-Modal Framing Analysis of News

Arora, Arnav, Yadav, Srishti, Antoniak, Maria, Belongie, Serge, Augenstein, Isabelle

arXiv.org Artificial IntelligenceApr-3-2025

Automated frame analysis of political communication is a popular task in computational social science that is used to study how authors select aspects of a topic to frame its reception. So far, such studies have been narrow, in that they use a fixed set of pre-defined frames and focus only on the text, ignoring the visual contexts in which those texts appear. Especially for framing in the news, this leaves out valuable information about editorial choices, which include not just the written article but also accompanying photographs. To overcome such limitations, we present a method for conducting multi-modal, multi-label framing analysis at scale using large (vision-)language models. Grounding our work in framing theory, we extract latent meaning embedded in images used to convey a certain point and contrast that to the text by comparing the respective frames used. We also identify highly partisan framing of topics with issue-specific frame analysis found in prior qualitative work. We demonstrate a method for doing scalable integrative framing analysis of both text and image in news, providing a more complete picture for understanding media bias.

computational linguistic, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.2096

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Israel (0.14)
North America > United States > California > San Francisco County > San Francisco (0.04)
(26 more...)

Genre: Research Report (0.82)

Industry:

Media > News (1.00)
Leisure & Entertainment (1.00)
Law > Criminal Law (1.00)
(7 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.87)

Add feedback

Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models

Li, Zhaoxin, Xi-Jia, Zhang, Altundas, Batuhan, Chen, Letian, Paleja, Rohan, Gombolay, Matthew

arXiv.org Artificial IntelligenceMar-20-2025

Semantic Interpretability in Reinforcement Learning (RL) enables transparency, accountability, and safer deployment by making the agent's decisions understandable and verifiable. Achieving this, however, requires a feature space composed of human-understandable concepts, which traditionally rely on human specification and fail to generalize to unseen environments. In this work, we introduce Semantically Interpretable Reinforcement Learning with Vision-Language Models Empowered Automation (SILVA), an automated framework that leverages pre-trained vision-language models (VLM) for semantic feature extraction and interpretable tree-based models for policy optimization. SILVA first queries a VLM to identify relevant semantic features for an unseen environment, then extracts these features from the environment. Finally, it trains an Interpretable Control Tree via RL, mapping the extracted features to actions in a transparent and interpretable manner. To address the computational inefficiency of extracting features directly with VLMs, we develop a feature extraction pipeline that generates a dataset for training a lightweight convolutional network, which is subsequently used during RL. By leveraging VLMs to automate tree-based RL, SILVA removes the reliance on human annotation previously required by interpretable models while also overcoming the inability of VLMs alone to generate valid robot policies, enabling semantically interpretable reinforcement learning without human-in-the-loop.

machine learning, natural language, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2503.16724

Country:

North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A Robust and Efficient Visual-Inertial Initialization with Probabilistic Normal Epipolar Constraint

Mu, Changshi, Feng, Daquan, Zheng, Qi, Zhuang, Yuan

arXiv.org Artificial IntelligenceOct-25-2024

Accurate and robust initialization is essential for Visual-Inertial Odometry (VIO), as poor initialization can severely degrade pose accuracy. During initialization, it is crucial to estimate parameters such as accelerometer bias, gyroscope bias, initial velocity, and gravity, etc. The IMU sensor requires precise estimation of gyroscope bias because gyroscope bias affects rotation, velocity and position. Most existing VIO initialization methods adopt Structure from Motion (SfM) to solve for gyroscope bias. However, SfM is not stable and efficient enough in fast motion or degenerate scenes. To overcome these limitations, we extended the rotation-translation-decoupling framework by adding new uncertainty parameters and optimization modules. First, we adopt a gyroscope bias optimizer that incorporates probabilistic normal epipolar constraints. Second, we fuse IMU and visual measurements to solve for velocity, gravity, and scale efficiently. Finally, we design an additional refinement module that effectively diminishes gravity and scale errors. Extensive initialization tests on the EuRoC dataset show that our method reduces the gyroscope bias and rotation estimation error by an average of 16% and 4% respectively. It also significantly reduces the gravity error, with an average reduction of 29%.

artificial intelligence, estimation, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2410.19473

Country:

Asia > China > Hubei Province > Wuhan (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning (0.46)

Add feedback

Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector

Dagar, Deepak, Vishwakarma, Dinesh Kumar

arXiv.org Artificial IntelligenceAug-29-2024

Deepfakes, which employ Generative Adversarial Networks (GANs) to produce highly realistic facial modification, are widely regarded as the prevailing method. Traditional Convolutional Neural Networks (CNNs) have been able to identify bogus media, but they struggle to perform well on different datasets and are vulnerable to adversarial attacks due to their lack of robustness. Vision transformers have demonstrated potential in the realm of image classification problems, but they require enough training data. Motivated by these limitations, this publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet (Residual Networks) with a vision transformer. The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation. The texture module then serves as an input to the dual branch of the cross-attention vision transformer. It specifically focuses on improving the global texture module, which extracts feature map correlation. Empirical analysis reveals that fake images exhibit smooth textures that do not remain consistent over long distances in manipulations. Experiments were performed on different categories of FaceForensics++ (FF++), such as Deepfakes (DF), Face2Face (f2f), Faceswap (FS), and Neural Texture (NT), together with other types of GAN datasets in cross-domain scenarios. Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF dataset underwent several post-processing situations, such as blurring, compression, and noise. The model surpassed the most advanced models in terms of generalization, achieving a 98% accuracy in cross-domain scenarios. This demonstrates its ability to learn the shared distinguishing textural characteristics in the manipulated samples. These experiments provide evidence that the proposed model is capable of being applied to various situations and is resistant to many postprocessing procedures. Keywords: Deepfake detector, Texture, Gram matrices, Generalization, Robustness. 1 Introduction With the advancements in technology, especially GANs, it is possible to generate highly realistic content that can easily deceive the naked eye. Deepfake is a current state of the art of visual and audio manipulation. Deepfake is a technology where highly astonishing, realistic, and believable content is created using deep learning technology (Figure 1). Visual deepfakes can be classified into five categories: lip sync, attribute manipulation, full-image synthesis, body re-enactment, and face ap [1]. The application of the deepfake has benefitted the education and entertainment industry in various ways.

category, dataset, manipulation, (15 more...)

arXiv.org Artificial Intelligence

2408.16892

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(16 more...)

Genre: Research Report > Experimental Study (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Efficient and Explanatory Image and Text Clustering System with Multimodal Autoencoder Architecture

Shi, Tiancheng, Wei, Yuanchen, Kender, John R.

arXiv.org Artificial IntelligenceAug-14-2024

We demonstrate the efficiencies and explanatory abilities of extensions to the common tools of Autoencoders and LLM interpreters, in the novel context of comparing different cultural approaches to the same international news event. We develop a new Convolutional-Recurrent Variational Autoencoder (CRVAE) model that extends the modalities of previous CVAE models, by using fully-connected latent layers to embed in parallel the CNN encodings of video frames, together with the LSTM encodings of their related text derived from audio. We incorporate the model within a larger system that includes frame-caption alignment, latent space vector clustering, and a novel LLM-based cluster interpreter. We measure, tune, and apply this system to the task of summarizing a video into three to five thematic clusters, with each theme described by ten LLM-produced phrases. We apply this system to two news topics, COVID-19 and the Winter Olympics, and five other topics are in progress.

efficient and explanatory image, image and text clustering system, video, (13 more...)

arXiv.org Artificial Intelligence

2408.07791

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)
South America > Argentina (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment > Sports (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

Filters

Collaborating Authors

image frame

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

144258c36a5559a6cf9f7d53a527eb57-Supplemental-Datasets_and_Benchmarks.pdf

Supplemental Material

Explainable Deep Anomaly Detection with Sequential Hypothesis Testing for Robotic Sewer Inspection

Fluoroscopic Shape and Pose Tracking of Catheters with Custom Radiopaque Markers

MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos

Multi-Modal Framing Analysis of News

Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models

A Robust and Efficient Visual-Inertial Initialization with Probabilistic Normal Epipolar Constraint

Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector

An Efficient and Explanatory Image and Text Clustering System with Multimodal Autoencoder Architecture