Collaborating Authors

Image Understanding

Introducing our image classification pilot


Enrichment plays a fundamental role in Europeana's activities. In our context, enrichment can be defined as generating metadata from the data provided by our partners, adding extra value to the data we receive. We use the combination of original and enriched metadata for indexing our records, and this lets us build functionalities that allow people to search and browse our collections, and receive recommendations. Achieving automatic enrichment using machine learning algorithms is one of the objectives of the Europeana Strategy 2020-2025, triggering projects such as Saint George on a Bike. Europeana's R&D team is exploring how computer vision techniques (systems which can make sense of visual data) can improve the enrichment Europeana conducts.

TinyML ESP32-CAM: Edge Image classification with Edge Impulse


This tutorial covers how to use TinyML with ESP32-CAM. It describes how to classify images using ESP32-CAM with a machine learning model running directly on the device. To do it, it is necessary to create a machine learning model using Tensorflow lite and shrink the model. There are several ways to do it, this tutorial uses Edge Impulse that simplifies all the steps. We will explore the power of TinyML with ESP32-CAM to recognize and classify images.

Transfer Learning and Image Classification with ML.NET


Historically, image classification is a problem that popularized deep neural networks especially visual types of neural networks – Convolutional neural networks (CNN). We will not go into details about what are CNNs and how they work. However, we can say that CNNs were popularized after they broke a record in The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) back in 2012. This competition evaluates algorithms for object detection and image classification at a large scale. The dataset that they provide contains 1000 image categories and over 1.2 million images.

Paying Attention to Activation Maps in Camera Pose Regression Artificial Intelligence

Camera pose regression methods apply a single forward pass to the query image to estimate the camera pose. As such, they offer a fast and light-weight alternative to traditional localization schemes based on image retrieval. Pose regression approaches simultaneously learn two regression tasks, aiming to jointly estimate the camera position and orientation using a single embedding vector computed by a convolutional backbone. We propose an attention-based approach for pose regression, where the convolutional activation maps are used as sequential inputs. Transformers are applied to encode the sequential activation maps as latent vectors, used for camera pose regression. This allows us to pay attention to spatially-varying deep features. Using two Transformer heads, we separately focus on the features for camera position and orientation, based on how informative they are per task. Our proposed approach is shown to compare favorably to contemporary pose regressors schemes and achieves state-of-the-art accuracy across multiple outdoor and indoor benchmarks. In particular, to the best of our knowledge, our approach is the only method to attain sub-meter average accuracy across outdoor scenes. We make our code publicly available from here.

Computational Emotion Analysis From Images: Recent Advances and Future Directions Artificial Intelligence

Understanding the information contained in the increasing repository of data is of vital importance to behavior sciences [34], which aim to predict human decision making and enable wide applications, such as mental health evaluation [14], business recommendation [33], opinion mining [54], and entertainment assistance [78]. Analyzing media data on an affective (emotional) level belongs to affective computing, which is defined as "the computing that relates to, arises from, or influences emotions" [38]. The importance of emotions has been emphasized for decades since Minsky introduced the relationship between intelligence and emotion [31]. One famous claim is "The question is not whether intelligent machines can have any emotions, but whether machines can be intelligent without emotions." Based on the types of media data, the research on affective computing can be classified into different categories, such as text [13, 72], image [75], speech [45], music [64], facial expression [24], video [56, 79], physiological signals [2], and multi-modal data [52, 41, 80]. The adage "a picture is worth a thousand words" indicates that images can convey rich semantics.

Machine Vision based Sample-Tube Localization for Mars Sample Return Artificial Intelligence

A potential Mars Sample Return (MSR) architecture is being jointly studied by NASA and ESA. As currently envisioned, the MSR campaign consists of a series of 3 missions: sample cache, fetch and return to Earth. In this paper, we focus on the fetch part of the MSR, and more specifically the problem of autonomously detecting and localizing sample tubes deposited on the Martian surface. Towards this end, we study two machine-vision based approaches: First, a geometry-driven approach based on template matching that uses hard-coded filters and a 3D shape model of the tube; and second, a data-driven approach based on convolutional neural networks (CNNs) and learned features. Furthermore, we present a large benchmark dataset of sample-tube images, collected in representative outdoor environments and annotated with ground truth segmentation masks and locations. The dataset was acquired systematically across different terrain, illumination conditions and dust-coverage; and benchmarking was performed to study the feasibility of each approach, their relative strengths and weaknesses, and robustness in the presence of adverse environmental conditions.

Relationship-based Neural Baby Talk Artificial Intelligence

Understanding interactions between objects in an image is an important element for generating captions. In this paper, we propose a relationship-based neural baby talk (R-NBT) model to comprehensively investigate several types of pairwise object interactions by encoding each image via three different relationship-based graph attention networks (GATs). We study three main relationships: \textit{spatial relationships} to explore geometric interactions, \textit{semantic relationships} to extract semantic interactions, and \textit{implicit relationships} to capture hidden information that could not be modelled explicitly as above. We construct three relationship graphs with the objects in an image as nodes, and the mutual relationships of pairwise objects as edges. By exploring features of neighbouring regions individually via GATs, we integrate different types of relationships into visual features of each node. Experiments on COCO dataset show that our proposed R-NBT model outperforms state-of-the-art models trained on COCO dataset in three image caption generation tasks.

Artificial Intelligence Index


"The technologies necessary for large-scale surveillance are rapidly maturing, with techniques for image classification, face recognition, video analysis, and voice identification all seeing significant progress in 2020." The figure shows the progress in the top-1 accuracy of the ImageNet challenge, a benchmark for image classification.

Self-supervised Pretraining of Visual Features in the Wild Artificial Intelligence

Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset. In this work, we explore if self-supervision lives to its expectation by training large models on random, uncurated images with no supervision. Our final SElf-supERvised (SEER) model, a RegNetY with 1.3B parameters trained on 1B random images with 512 GPUs achieves 84.2% top-1 accuracy, surpassing the best self-supervised pretrained model by 1% and confirming that self-supervised learning works in a real world setting. Interestingly, we also observe that self-supervised models are good few-shot learners achieving 77.9% top-1 with access to only 10% of ImageNet. Code:

NFL hopefuls are adding AI video analysis to their arsenal


More than 130 football players have been training under the watchful eye of the athletic performance development company EXOS in Arizona, all in hopes of landing a first-round NFL draft pick. As it turns out, though, the eyes they've been working in front of aren't exclusively human. Intel today said that EXOS's latest batch of NFL hopefuls have been training in front of video cameras that -- with the help of the company's 3D athlete tracking system -- should give players and staff a finer sense of their "body mechanics or trouble spots." "3DAT allows athletes to understand precisely what their body is doing while in motion, so they can precisely target where to make tweaks to get faster or better," said Ashton Eaton, Intel product development engineer and two-time Olympic gold medalist. The beauty of Intel's 3DAT system is that athletes don't need to strap on cumbersome sensors, or worry about precarious placement of gear during drills. Instead, run-of-the-mill video footage is shuttled off to servers packing Intel Xeon Scalable processors loaded with the company's "Deep Learning Boost" AI acceleration capabilities.