Goto

Collaborating Authors

 Asteriadis, Stylianos


Unsupervised Interpretable Basis Extraction for Concept-Based Visual Explanations

arXiv.org Artificial Intelligence

Abstract--An important line of research attempts to explain CNN image classifier predictions and intermediate layer representations in terms of human understandable concepts. In this work, we expand on previous works in the literature that use annotated concept datasets to extract interpretable feature space directions and propose an unsupervised post-hoc method to extract a disentangling interpretable basis by looking for the rotation of the feature space that explains sparse one-hot thresholded transformed representations of pixel activations. They can be used in robotics, visual understanding, automatic risk assessment and more. However, to a human expert, CNNs are often black-boxes and the reasoning behind their predictions can be unclear. Beyond this early result, more recently, rigorous experimentation showed that linear separability of features corresponding to different semantic concepts, increases towards the top layer [6]. The latter has been attributed to the top layer's linearity and the fact that intermediate layers are enforced to A. Doumanoglou is with the Information Technologies Institute (ITI), Centre for Research and Technology HELLAS (CERTH), Thessaloniki, D. Zarpalas is with the Information Technologies Institute (ITI), Centre for Figure 1: Left: In a standard convolution layer with D filters, all the filters work together to transform each input patch to a feature vector of spatial dimensionality 1 1. Thus the dimensionality of the feature space equals the number of filters in the layer, and each spatial element of the transformed representation, constitutes a sample in this feature space. Middle: To find an interpretable basis in the aforementioned feature space in a supervised way, it means to train a set of linear classifiers (concept detectors), one for each interpretable concept, by using feature vectors corresponding to image patches containing the concept. We observe, that in a successfully learned interpretable basis, a single pixel is classified positively by at most one classifier, among a group of classifiers that are trained to detect mutually-exclusive concepts. Projecting new, transformed sparse representation.


Explaining, Analyzing, and Probing Representations of Self-Supervised Learning Models for Sensor-based Human Activity Recognition

arXiv.org Artificial Intelligence

In recent years, self-supervised learning (SSL) frameworks have been extensively applied to sensor-based Human Activity Recognition (HAR) in order to learn deep representations without data annotations. While SSL frameworks reach performance almost comparable to supervised models, studies on interpreting representations learnt by SSL models are limited. Nevertheless, modern explainability methods could help to unravel the differences between SSL and supervised representations: how they are being learnt, what properties of input data they preserve, and when SSL can be chosen over supervised training. In this paper, we aim to analyze deep representations of two recent SSL frameworks, namely SimCLR and VICReg. Specifically, the emphasis is made on (i) comparing the robustness of supervised and SSL models to corruptions in input data; (ii) explaining predictions of deep learning models using saliency maps and highlighting what input channels are mostly used for predicting various activities; (iii) exploring properties encoded in SSL and supervised representations using probing. Extensive experiments on two single-device datasets (MobiAct and UCI-HAR) have shown that self-supervised learning representations are significantly more robust to noise in unseen data compared to supervised models. In contrast, features learnt by the supervised approaches are more homogeneous across subjects and better encode the nature of activities.


Contrastive Learning with Cross-Modal Knowledge Mining for Multimodal Human Activity Recognition

arXiv.org Artificial Intelligence

Human Activity Recognition is a field of research where input data can take many forms. Each of the possible input modalities describes human behaviour in a different way, and each has its own strengths and weaknesses. We explore the hypothesis that leveraging multiple modalities can lead to better recognition. Since manual annotation of input data is expensive and time-consuming, the emphasis is made on self-supervised methods which can learn useful feature representations without any ground truth labels. We extend a number of recent contrastive self-supervised approaches for the task of Human Activity Recognition, leveraging inertial and skeleton data. Furthermore, we propose a flexible, general-purpose framework for performing multimodal self-supervised learning, named Contrastive Multiview Coding with Cross-Modal Knowledge Mining (CMC-CMKM). This framework exploits modality-specific knowledge in order to mitigate the limitations of typical self-supervised frameworks. The extensive experiments on two widely-used datasets demonstrate that the suggested framework significantly outperforms contrastive unimodal and multimodal baselines on different scenarios, including fully-supervised fine-tuning, activity retrieval and semi-supervised learning. Furthermore, it shows performance competitive even compared to supervised methods.