Collaborating Authors

DARE: AI-based Diver Action Recognition System using Multi-Channel CNNs for AUV Supervision Artificial Intelligence

With the growth of sensing, control and robotic technologies, autonomous underwater vehicles (AUVs) have become useful assistants to human divers for performing various underwater operations. In the current practice, the divers are required to carry expensive, bulky, and waterproof keyboards or joystick-based controllers for supervision and control of AUVs. Therefore, diver action-based supervision is becoming increasingly popular because it is convenient, easier to use, faster, and cost effective. However, the various environmental, diver and sensing uncertainties present underwater makes it challenging to train a robust and reliable diver action recognition system. In this regard, this paper presents DARE, a diver action recognition system, that is trained based on Cognitive Autonomous Driving Buddy (CADDY) dataset, which is a rich set of data containing images of different diver gestures and poses in several different and realistic underwater environments. DARE is based on fusion of stereo-pairs of camera images using a multi-channel convolutional neural network supported with a systematically trained tree-topological deep neural network classifier to enhance the classification performance. DARE is fast and requires only a few milliseconds to classify one stereo-pair, thus making it suitable for real-time underwater implementation. DARE is comparatively evaluated against several existing classifier architectures and the results show that DARE supersedes the performance of all classifiers for diver action recognition in terms of overall as well as individual class accuracies and F1-scores.

Learning to Anonymize Faces for Privacy Preserving Action Detection Artificial Intelligence

There is an increasing concern in computer vision devices invading users' privacy by recording unwanted videos. On the one hand, we want the camera systems to recognize important events and assist human daily lives by understanding its videos, but on the other hand we want to ensure that they do not intrude people's privacy. In this paper, we propose a new principled approach for learning a video \emph{face anonymizer}. We use an adversarial training setting in which two competing systems fight: (1) a video anonymizer that modifies the original video to remove privacy-sensitive information while still trying to maximize spatial action detection performance, and (2) a discriminator that tries to extract privacy-sensitive information from the anonymized videos. The end result is a video anonymizer that performs pixel-level modifications to anonymize each person's face, with minimal effect on action detection performance. We experimentally confirm the benefits of our approach compared to conventional hand-crafted anonymization methods including masking, blurring, and noise adding. Code, demo, and more results can be found on our project page

Deep-Aligned Convolutional Neural Network for Skeleton-based Action Recognition and Segmentation Machine Learning

Convolutional neural networks (CNNs) are deep learning fra meworks which are well-known for their notable performance in classification tasks. Hence, many skeleton-based action recognition and segmentation (SBAR S) algorithms benefit from them in their designs. However, a shortcoming of such applications is the general lack of spatial relationships between the input features in such data types. Besides, nonuniform temporal scalings is a common i ssue in skeleton-based data streams which leads to having different input siz es even within one specific action category. In this work, we propose a novel dee p-aligned convolu-tional neural network (DACNN) to tackle the above challenge s for the particular problem of SBARS. Our network is designed by introducing a ne w type of filters in the context of CNNs which are trained based on their alignm ents to the local subsequences in the inputs. These filters result in efficient predictions as well as learning interpretable patterns in the data. W e empiricall y evaluate our framework on real-world benchmarks showing that the proposed DACNN al gorithm obtains a competitive performance compared to the state-of-the-ar t while benefiting from a less complicated yet more interpretable model.

What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention Artificial Intelligence

Egocentric action anticipation consists in understanding which objects the camera wearer will interact with in the near future and which actions they will perform. We tackle the problem proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs to 1) summarize the past, and 2) formulate predictions about the future. The input video is processed considering three complimentary modalities: appearance (RGB), motion (optical flow) and objects (object-based features). Modality-specific predictions are fused using a novel Modality ATTention (MATT) mechanism which learns to weigh modalities in an adaptive fashion. Extensive evaluations on two large-scale benchmark datasets show that our method outperforms prior art by up to +7% on the challenging EPIC-KITCHENS dataset including more than 2500 actions, and generalizes to EGTEA Gaze+. Our approach is also shown to generalize to the tasks of early action recognition and action recognition. At the moment of submission, our method is ranked first in the leaderboard of the EPIC-KITCHENS egocentric action anticipation challenge.

Recognizing Actions in 3D Using Action-Snippets and Activated Simplices

AAAI Conferences

Pose-based action recognition in 3D is the task of recognizing an action (e.g., walking or running) from a sequence of 3D skeletal poses. This is challenging because of variations due to different ways of performing the same action and inaccuracies in the estimation of the skeletal poses. The training data is usually small and hence complex classifiers risk over-fitting the data. We address this task by action-snippets which are short sequences of consecutive skeletal poses capturing the temporal relationships between poses in an action. We propose a novel representation for action-snippets, called activated simplices. Each activity is represented by a manifold which is approximated by an arrangement of activated simplices. A sequence (of action-snippets) is classified by selecting the closest manifold and outputting the corresponding activity. This is a simple classifier which helps avoid over-fitting the data but which significantly outperforms state-of-the-art methods on standard benchmarks.