Country
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
Chen, Shizhe, Zhao, Yida, Jin, Qin, Wu, Qi
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. The current dominant approach for this problem is to learn a joint embedding space to measure cross-modal similarities. However, simple joint embeddings are insufficient to represent complicated visual and textual details, such as scenes, objects, actions and their compositions. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels. To be specific, the model disentangles texts into hierarchical semantic graph including three levels of events, actions, entities and relationships across levels. Attention-based graph reasoning is utilized to generate hierarchical textual embeddings, which can guide the learning of diverse and hierarchical video representations. The HGR model aggregates matchings from different video-text levels to capture both global and local details. Experimental results on three video-text datasets demonstrate the advantages of our model. Such hierarchical decomposition also enables better generalization across datasets and improves the ability to distinguish fine-grained semantic differences.
Hazard Detection in Supermarkets using Deep Learning on the Edge
Murshed, M. G. Sarwar, Verenich, Edward, Carroll, James J., Khan, Nazar, Hussain, Faraz
Supermarkets need to ensure clean and safe environments for both shoppers and employees. Slips, trips, and falls can result in injuries that have a physical as well as financial cost. Timely detection of hazardous conditions such as spilled liquids or fallen items on supermarket floors can reduce the chances of serious injuries. This paper presents EdgeLite, a novel, lightweight deep learning model for easy deployment and inference on resource-constrained devices. We describe the use of EdgeLite on two edge devices for detecting supermarket floor hazards. On a hazard detection dataset that we developed, EdgeLite, when deployed on edge devices, outperformed six state-of-the-art object detection models in terms of accuracy while having comparable memory usage and inference time.
"Who is Driving around Me?" Unique Vehicle Instance Classification using Deep Neural Features
Oosterhuis, Tim, Schomaker, Lambert
Being aware of other traffic is a prerequisite for self-driving cars to operate in the real world. In this paper, we show how the intrinsic feature maps of an object detection CNN can be used to uniquely identify vehicles from a dash-cam feed. Feature maps of a pretrained `YOLO' network are used to create 700 deep integrated feature signatures (DIFS) from 20 different images of 35 vehicles from a high resolution dataset and 340 signatures from 20 different images of 17 vehicles of a lower resolution tracking benchmark dataset. The YOLO network was trained to classify general object categories, e.g. classify a detected object as a `car' or `truck'. 5-Fold nearest neighbor (1NN) classification was used on DIFS created from feature maps in the middle layers of the network to correctly identify unique vehicles at a rate of 96.7\% for the high resolution data and with a rate of 86.8\% for the lower resolution data. We conclude that a deep neural detection network trained to distinguish between different classes can be successfully used to identify different instances belonging to the same class, through the creation of deep integrated feature signatures (DIFS).
The Utility of Feature Reuse: Transfer Learning in Data-Starved Regimes
Verenich, Edward, Velasquez, Alvaro, Murshed, M. G. Sarwar, Hussain, Faraz
The use of transfer learning with deep neural networks has increasingly become widespread for deploying well-tested computer vision systems to newer domains, especially those with limited datasets. We describe a transfer learning use case for a domain with a data-starved regime, having fewer than 100 labeled target samples. We evaluate the effectiveness of convolutional feature extraction and fine-tuning of overparameterized models with respect to the size of target training data, as well as their generalization performance on data with covariate shift, or out-of-distribution (OOD) data. Our experiments show that both overparameterization and feature reuse contribute to successful application of transfer learning in training image classifiers in data-starved regimes.
Robust Robotic Pouring using Audition and Haptics
Liang, Hongzhuo, Zhou, Chuangchuang, Li, Shuang, Ma, Xiaojian, Hendrich, Norman, Gerkmann, Timo, Sun, Fuchun, Zhang, Jianwei
Robust and accurate estimation of liquid height lies as an essential part of pouring tasks for service robots. However, vision-based methods often fail in occluded conditions while audio-based methods cannot work well in a noisy environment. We instead propose a multimodal pouring network (MP-Net) that is able to robustly predict liquid height by conditioning on both audition and haptics input. MP-Net is trained on a self-collected multimodal pouring dataset. This dataset contains 300 robot pouring recordings with audio and force/torque measurements for three types of target containers. We also augment the audio data by inserting robot noise. We evaluated MP-Net on our collected dataset and a wide variety of robot experiments. Both network training results and robot experiments demonstrate that MP-Net is robust against noise and changes to the task and environment. Moreover, we further combine the predicted height and force data to estimate the shape of the target container.
Towards a predictive spatio-temporal representation of brain data
Azevedo, Tiago, Passamonti, Luca, Liรฒ, Pietro, Toschi, Nicola
The characterisation of the brain as a "connectome", in which the connections are represented by correlational values across timeseries and as summary measures derived from graph theory analyses, has been very popular in the last years. However, although this representation has advanced our understanding of the brain function, it may represent an oversimplified model. This is because the typical fMRI datasets are constituted by complex and highly heterogeneous timeseries that vary across space (i.e., location of brain regions). We compare various modelling techniques from deep learning and geometric deep learning to pave the way for future research in effectively leveraging the rich spatial and temporal domains of typical fMRI datasets, as well as of other similar datasets. As a proof-of-concept, we compare our approaches in the homogeneous and publicly available Human Connectome Project (HCP) dataset on a supervised binary classification task. We hope that our methodological advances relative to previous "connectomic" measures can ultimately be clinically and computationally relevant by leading to a more nuanced understanding of the brain dynamics in health and disease. Such understanding of the brain can fundamentally reduce the constant specialised clinical expertise in order to accurately understand brain variability.
FlexServe: Deployment of PyTorch Models as Flexible REST Endpoints
Verenich, Edward, Velasquez, Alvaro, Murshed, M. G. Sarwar, Hussain, Faraz
The integration of artificial intelligence capabilities into modern software systems is increasingly being simplified through the use of cloud-based machine learning services and representational state transfer architecture design. However, insufficient information regarding underlying model provenance and the lack of control over model evolution serve as an impediment to the more widespread adoption of these services in many operational environments which have strict security requirements. Furthermore, tools such as TensorFlow Serving allow models to be deployed as RESTful endpoints, but require error-prone transformations for PyTorch models as these dynamic computational graphs. This is in contrast to the static computational graphs of TensorFlow. To enable rapid deployments of PyTorch models without intermediate transformations we have developed FlexServe, a simple library to deploy multi-model ensembles with flexible batching.
Voice Separation with an Unknown Number of Multiple Speakers
Nachmani, Eliya, Adi, Yossi, Wolf, Lior
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and a the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method greatly outperforms the current state of the art, which, as we show, is not competitive for more than two speakers.
Why is the Mahalanobis Distance Effective for Anomaly Detection?
The Mahalanobis distance-based confidence score, a recently proposed anomaly detection method for pre-trained neural classifiers, achieves state-of-the-art performance on both out-of-distribution and adversarial example detection. This work analyzes why this method exhibits such strong performance while imposing an implausible assumption; namely, that class conditional distributions of intermediate features have tied covariance. We reveal that the reason for its effectiveness has been misunderstood. Although this method scores the prediction confidence for the original classification task, our analysis suggests that information critical for classification task does not contribute to state-of-the-art performance on anomaly detection. To support this hypothesis, we demonstrate that a simpler confidence score that does not use class information is as effective as the original method in most cases. Moreover, our experiments show that the confidence scores can exhibit different behavior on other frameworks such as metric learning models, and their detection performance is sensitive to model architecture choice. These findings provide insight into the behavior of neural classifiers when provided with anomalous inputs.
Stable behaviour of infinitely wide deep neural networks
Favaro, Stefano, Fortini, Sandra, Peluchetti, Stefano
We consider fully connected feed-forward deep neural networks (NNs) where weights and biases are independent and identically distributed as symmetric centered stable distributions. Then, we show that the infinite wide limit of the NN, under suitable scaling on the weights, is a stochastic process whose finite-dimensional distributions are multivariate stable distributions. The limiting process is referred to as the stable process, and it generalizes the class of Gaussian processes recently obtained as infinite wide limits of NNs (Matthews at al., 2018b). Parameters of the stable process can be computed via an explicit recursion over the layers of the network. Our result contributes to the theory of fully connected feed-forward deep NNs, and it paves the way to expand recent lines of research that rely on Gaussian infinite wide limits.