AITopics | Saha, Suman

Collaborating Authors

Saha, Suman

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Language-Guided Instance-Aware Domain-Adaptive Panoptic Segmentation

Mansour, Elham Amin, Unal, Ozan, Saha, Suman, Bejar, Benjamin, Van Gool, Luc

arXiv.org Artificial IntelligenceApr-4-2024

The increasing relevance of panoptic segmentation is tied to the advancements in autonomous driving and AR/VR applications. However, the deployment of such models has been limited due to the expensive nature of dense data annotation, giving rise to unsupervised domain adaptation (UDA). A key challenge in panoptic UDA is reducing the domain gap between a labeled source and an unlabeled target domain while harmonizing the subtasks of semantic and instance segmentation to limit catastrophic interference. While considerable progress has been achieved, existing approaches mainly focus on the adaptation of semantic segmentation. In this work, we focus on incorporating instance-level adaptation via a novel instance-aware cross-domain mixing strategy IMix. IMix significantly enhances the panoptic quality by improving instance segmentation performance. Specifically, we propose inserting high-confidence predicted instances from the target domain onto source images, retaining the exhaustiveness of the resulting pseudo-labels while reducing the injected confirmation bias. Nevertheless, such an enhancement comes at the cost of degraded semantic performance, attributed to catastrophic forgetting. To mitigate this issue, we regularize our semantic branch by employing CLIP-based domain alignment (CDA), exploiting the domain-robustness of natural language prompts. Finally, we present an end-to-end model incorporating these two mechanisms called LIDAPS, achieving state-of-the-art results on all popular panoptic UDA benchmarks.

machine learning, natural language, segmentation, (16 more...)

arXiv.org Artificial Intelligence

2404.03799

Genre: Research Report (0.50)

Industry:

Transportation (0.34)
Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Unal, Ozan, Sakaridis, Christos, Saha, Suman, Yu, Fisher, Van Gool, Luc

arXiv.org Artificial IntelligenceSep-8-2023

3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen in popularity. A common formulation to tackle 3D visual grounding is grounding-by-detection, where localization is done via bounding boxes. However, for real-life applications that require physical interactions, a bounding box insufficiently describes the geometry of an object. We therefore tackle the problem of dense 3D visual grounding, i.e. referral-based 3D instance segmentation. We propose a dense 3D grounding network ConcreteNet, featuring three novel stand-alone modules which aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next we construct a contrastive training scheme to induce separation in the latent space, and finally we resolve view-dependent utterances via a learned global camera token. ConcreteNet ranks 1st on the challenging ScanRefer online benchmark by a considerable +9.43% accuracy at 50% IoU and has won the ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge.

machine learning, natural language, visual grounding, (17 more...)

arXiv.org Artificial Intelligence

2309.04561

Country: Europe > Switzerland > Zürich > Zürich (0.16)

Genre: Research Report (0.82)

Industry: Automobiles & Trucks (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.35)

Add feedback

ROAD: The ROad event Awareness Dataset for Autonomous Driving

Singh, Gurkirt, Akrigg, Stephen, Di Maio, Manuele, Fontana, Valentina, Alitappeh, Reza Javanmard, Saha, Suman, Jeddisaravi, Kossar, Yousefi, Farzad, Culley, Jacob, Nicholson, Tom, Omokeowa, Jordan, Khan, Salman, Grazioso, Stanislao, Bradley, Andrew, Di Gironimo, Giuseppe, Cuzzolin, Fabio

arXiv.org Artificial IntelligenceFeb-25-2021

Humans approach driving in a holistic fashion which entails, in particular, understanding road events and their evolution. Injecting these capabilities in an autonomous vehicle has thus the potential to take situational awareness and decision making closer to human-level performance. To this purpose, we introduce the ROad event Awareness Dataset (ROAD) for Autonomous Driving, to our knowledge the first of its kind. ROAD is designed to test an autonomous vehicle's ability to detect road events, defined as triplets composed by a moving agent, the action(s) it performs and the corresponding scene locations. ROAD comprises 22 videos, originally from the Oxford RobotCar Dataset, annotated with bounding boxes showing the location in the image plane of each road event. We also provide as baseline a new incremental algorithm for online road event awareness, based on inflating RetinaNet along time, which achieves a mean average precision of 16.8% and 6.1% for frame-level and video-level event detection, respectively, at 50% overlap. Though promising, these figures highlight the challenges faced by situation awareness in autonomous driving. Finally, ROAD allows scholars to investigate exciting tasks such as complex (road) activity detection, future road event anticipation and the modelling of sentient road agents in terms of mental states. Dataset can be obtained from https://github.com/gurkirt/road-dataset and baseline code from https://github.com/gurkirt/3D-RetinaNet.

detection, ground transportation, survey article, (20 more...)

arXiv.org Artificial Intelligence

2102.11585

Country:

Europe > Italy (0.14)
Asia > Middle East (0.14)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (1.00)
Information Technology > Robotics & Automation (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Predicting Action Tubes

Singh, Gurkirt, Saha, Suman, Cuzzolin, Fabio

arXiv.org Artificial IntelligenceAug-23-2018

In this work, we present a method to predict an entire `action tube' (a set of temporally linked bounding boxes) in a trimmed video just by observing a smaller subset of it. Predicting where an action is going to take place in the near future is essential to many computer vision based applications such as autonomous driving or surgical robotics. Importantly, it has to be done in real-time and in an online fashion. We propose a Tube Prediction network (TPnet) which jointly predicts the past, present and future bounding boxes along with their action classification scores. At test time TPnet is used in a (temporal) sliding window setting, and its predictions are put into a tube estimation framework to construct/predict the video long action tubes not only for the observed part of the video but also for the unobserved part. Additionally, the proposed action tube predictor helps in completing action tubes for unobserved segments of the video. We quantitatively demonstrate the latter ability, and the fact that TPnet improves state-of-the-art detection performance, on one of the standard action detection benchmarks - J-HMDB-21 dataset.

deep learning, neural network, prediction, (18 more...)

arXiv.org Artificial Intelligence

1808.07712

Country: Europe > Netherlands (0.14)

Genre: Research Report (0.40)

Industry:

Transportation > Ground > Road (0.48)
Information Technology > Robotics & Automation (0.48)
Automobiles & Trucks (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.66)

Add feedback