AITopics | visual feature extractor

Collaborating Authors

visual feature extractor

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MiningtheBenefitsofTwo-stageandOne-stage HOIDetection

Neural Information Processing SystemsFeb-9-2026, 21:15:00 GMT

Thenewlyintroduced disentangling paradigm outperforms existing methods by a large margin, with a significant relative mAP gain of9.32% on HICO-Det.

artificial intelligence, incvpr, machine learning, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Wozniak, Maciej K., Liu, Lianhang, Cai, Yixi, Jensfelt, Patric

arXiv.org Artificial IntelligenceJul-25-2025

While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive Li-DAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. T o address these challenges, we propose PRIX (Plan from Raw Pix els). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for Li-DAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. W e demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix .

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2507.17596

Genre: Research Report (1.00)

Industry:

Automobiles & Trucks (0.74)
Transportation > Ground > Road (0.64)
Information Technology > Robotics & Automation (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection

Lee, Kyungbok, Zhang, You, Duan, Zhiyao

arXiv.org Artificial IntelligenceJun-20-2024

This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake. Motivated by these considerations, we then propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique. We study the generalization problem of audio-visual deepfake detection by creating a new benchmark by extending and re-splitting the existing FakeAVCeleb dataset. The benchmark contains four categories of fake video(Real Audio-Fake Visual, Fake Audio-Fake Visual, Fake Audio-Real Visual, and unsynchronized video). The experimental results show that our approach improves the model's detection of unseen attacks by an average of 7.31% across four test sets, compared to the baseline model. Additionally, our proposed framework offers interpretability, indicating which modality the model identifies as fake.

detection, feature extractor, modality, (12 more...)

arXiv.org Artificial Intelligence

2406.14176

Country:

Asia (0.04)
North America > United States > New York > Monroe County > Rochester (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Interactive Visual Task Learning for Robots

Gu, Weiwei, Sah, Anant, Gopalan, Nakul

arXiv.org Artificial IntelligenceDec-20-2023

We present a framework for robots to learn novel visual concepts and tasks via in-situ linguistic interactions with human users. Previous approaches have either used large pre-trained visual models to infer novel objects zero-shot, or added novel concepts along with their attributes and representations to a concept hierarchy. We extend the approaches that focus on learning visual concept hierarchies by enabling them to learn novel concepts and solve unseen robotics tasks with them. To enable a visual concept learner to solve robotics tasks one-shot, we developed two distinct techniques. Firstly, we propose a novel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which augments information of a novel concept to its parent nodes within a concept hierarchy. This information propagation allows all concepts in a hierarchy to update as novel concepts are taught in a continual learning setting. Secondly, we represent a visual task as a scene graph with language annotations, allowing us to create novel permutations of a demonstrated task zero-shot in-situ. We present two sets of results. Firstly, we compare Hi-Viscont with the baseline model (FALCON) on visual question answering(VQA) in three domains. While being comparable to the baseline model on leaf level concepts, Hi-Viscont achieves an improvement of over 9% on non-leaf concepts on average. We compare our model's performance against the baseline FALCON model. Our framework achieves 33% improvements in success rate metric, and 19% improvements in the object level accuracy compared to the baseline model. With both of these results we demonstrate the ability of our model to learn tasks and concepts in a continual learning setting on the robot.

falcon, hi-viscont, node, (15 more...)

arXiv.org Artificial Intelligence

2312.13219

Country:

North America > United States > California (0.04)
North America > United States > Arizona (0.04)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (0.70)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.45)

Add feedback

Fairness Indicators for Systematic Assessments of Visual Feature Extractors

Goyal, Priya, Soriano, Adriana Romero, Hazirbas, Caner, Sagun, Levent, Usunier, Nicolas

arXiv.org Artificial IntelligenceFeb-15-2022

Does everyone equally benefit from computer vision systems? Answers to this question become more and more important as computer vision systems are deployed at large scale, and can spark major concerns when they exhibit vast performance discrepancies between people from various demographic and social backgrounds. Systematic diagnosis of fairness, harms, and biases of computer vision systems is an important step towards building socially responsible systems. To initiate an effort towards standardized fairness audits, we propose three fairness indicators, which aim at quantifying harms and biases of visual systems. Our indicators use existing publicly available datasets collected for fairness evaluations, and focus on three main types of harms and bias identified in the literature, namely harmful label associations, disparity in learned representations of social and demographic traits, and biased performance on geographically diverse images from across the world.We define precise experimental protocols applicable to a wide range of computer vision models. These indicators are part of an ever-evolving suite of fairness probes and are not intended to be a substitute for a thorough analysis of the broader impact of the new computer vision technologies. Yet, we believe it is a necessary first step towards (1) facilitating the widespread adoption and mandate of the fairness assessments in computer vision research, and (2) tracking progress towards building socially responsible models. To study the practical effectiveness and broad applicability of our proposed indicators to any visual system, we apply them to off-the-shelf models built using widely adopted model training paradigms which vary in their ability to whether they can predict labels on a given image or only produce the embeddings. We also systematically study the effect of data domain and model size.

fairness indicator, systematic assessment, visual feature extractor

arXiv.org Artificial Intelligence

2202.07603

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Vision > Image Understanding (0.40)

Add feedback

Avoiding hashing and encouraging visual semantics in referential emergent language games

Mihai, Daniela, Hare, Jonathon

arXiv.org Machine LearningNov-13-2019

There has been an increasing interest in the area of emergent communication between agents which learn to play referential signalling games with realistic images. In this work, we consider the signalling game setting of Havrylov and Titov and investigate the effect of the feature extractor's weights and of the task being solved on the visual semantics learned or captured by the models. We impose various augmentation to the input images and additional tasks in the game with the aim to induce visual representations which capture conceptual properties of images. Through our set of experiments, we demonstrate that communication systems which capture visual semantics can be learned in a completely self-supervised manner by playing the right types of game.

international conference, rotation, vgg16 relu 7, (14 more...)

arXiv.org Machine Learning

1911.05546

Country:

Europe > France (0.05)
North America > United States (0.04)
North America > Canada (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.69)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)

Add feedback

Learning Differences Between Visual Scanning Patterns Can Disambiguate Bipolar and Unipolar Patients

Chung, Jonathan (University of Toronto) | Eizenman, Moshe (University of Toronto) | Rakita, Uros (University of Toronto) | McIntyre, Roger (University Health Networks, University of Toronto) | Giacobbe, Peter (University Health Networks, University of Toronto)

AAAI ConferencesFeb-8-2018

Bipolar Disorder (BD) and Major Depressive Disorder (MDD) are two common and debilitating mood disorders. Misdiagnosing BD as MDD is relatively common and the introduction of markers to improve diagnostic accuracy early in the course of the illness has been identified as one of the top unmet needs in the field. In this paper, we present novel methods to differentiate between BD and MDD patients. The methods use deep learning techniques to quantify differences between visual scanning patterns of BD and MDD patients. In the methods, visual scanning patterns that are described by ordered sequences of fixations on emotional faces are encoded into a lower dimensional space and are fed into a long-short term memory recurrent neural network (RNN). Fixation sequences are encoded by three different methods: 1) using semantic regions of interests (RoIs) that are manually defined by experts, 2) using semi-automatically defined grids of RoIs, or 3) using a convolutional neural network (CNN) to automatically extract visual features from saliency maps. Using data from 47 patients with MDD and 26 patients with BD we showed that using semantic RoIs, the RNN improved the performance of a baseline classifier from an AUC of 0.603 to an AUC of 0.878. Similarly using grid RoIs, the RNN improved the performance of a baseline classifier from an AUC of 0.450 to an AUC of 0.828. The classifier that automatically extracted visual features from saliency maps (a long recurrent convolutional network that is fully data-driven) had an AUC of 0.879. The results of the study suggest that by using RNNs to learn differences between fixation sequences the diagnosis of individual patients with BD or MDD can be disambiguated with high accuracy. Moreover, by using saliency maps and CNN to encode the fixation sequences the method can be fully automated and achieve high accuracy without relying on user expertise and/or manual labelling. When compared with other markers, the performance of the class of classifiers that was introduced in this paper is better than that of detectors that use differences in neural structures, neural activity or cortical hemodynamics to differentiate between BD and MDD patients. The novel use of RNNs to quantify differences between fixation sequences of patients with mood disorders can be easily generalized to studies of other neuropsychological disorders and to other fields such as psychology and advertising.

artificial intelligence, machine learning, sequence, (17 more...)

AAAI Conferences

Thirty-Second AAAI Conference on Artificial Intelligence

Country: North America > Canada > Ontario > Toronto (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback