"Image understanding (IU) is the research area concerned with the design and experimentation of computer systems that integrate explicit models of a visual problem domain with one or more methods for extracting features from images and one or more methods for matching features with models using a control structure. Given a goal, or a reason for looking at a particular scene, these systems produce descriptions of both the images and the world scenes that the images represent."
– Image Understanding, by J.K. Tsotos. In Encyclopedia of Artificial Intelligence. Stuart C. Shapiro, editor. 1987. New York: John Wiley & Sons.
We present a new co-clustering problem of images and visual features. The problem involves a set of non-object images in addition to a set of object images and features to be co-clustered. Co-clustering is performed in a way of maximising discrimination of object images from non-object images, thus emphasizing discriminative features. This provides a way of obtaining perceptual joint-clusters of object images and features. We tackle the problem by simultaneously boosting multiple strong classifiers which compete for images by their expertise.
We propose a new supervised learning framework for visual object counting tasks, such as estimating the number of cells in a microscopic image or the number of humans in surveillance video frames. We focus on the practically-attractive case when the training images are annotated with dots (one dot per object). Our goal is to accurately estimate the count. However, we evade the hard task of learning to detect and localize individual object instances. Instead, we cast the problem as that of estimating an image density whose integral over any image region gives the count of objects within that region.
Probabilistic approaches to computer vision typically assume a centralized setting, with the algorithm granted access to all observed data points. However, many problems in wide-area surveillance can benefit from distributed modeling, either because of physical or computational constraints. Most distributed models to date use algebraic approaches (such as distributed SVD) and as a result cannot explicitly deal with missing data. In this work we present an approach to estimation and learning of generative probabilistic models in a distributed context where certain sensor data can be missing. In particular, we show how traditional centralized models, such as probabilistic PCA and missing-data PPCA, can be learned when the data is distributed across a network of sensors.
We describe a novel technique for feature combination in the bag-of-words model of image classification. Our approach builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is more statistically robust for typical classification problems than attempting to empirically estimate the dependent, joint-cue distribution directly. We use Information theoretic vocabulary compression to find discriminative combinations of cues and the resulting vocabulary of portmanteau words is compact, has the cue binding property, and supports individual weighting of cues in the final image representation. Papers published at the Neural Information Processing Systems Conference.
This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of objects, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images.
As massively parallel computations have become broadly available with modern GPUs, deep architectures trained on very large datasets have risen in popularity. Discriminatively trained convolutional neural networks, in particular, were recently shown to yield state-of-the-art performance in challenging image classification benchmarks such as ImageNet. However, elements of these architectures are similar to standard hand-crafted representations used in computer vision. In this paper, we explore the extent of this analogy, proposing a version of the state-of-the-art Fisher vector image encoding that can be stacked in multiple layers. This architecture significantly improves on standard Fisher vectors, and obtains competitive results with deep convolutional networks at a significantly smaller computational cost.
How to leverage the temporal dimension is a key question in video analysis. Recent works suggest an efficient approach to video feature learning, i.e., factorizing 3D convolutions into separate components respectively for spatial and temporal convolutions. The temporal convolution, however, comes with an implicit assumption – the feature maps across time steps are well aligned so that the features at the same locations can be aggregated. This assumption may be overly strong in practical applications, especially in action recognition where the motion serves as a crucial cue. In this work, we propose a new CNN architecture TrajectoryNet, which incorporates trajectory convolution, a new operation for integrating features along the temporal dimension, to replace the existing temporal convolution.
The concepts of neural architecture search and transfer learning are used under the hood to find the best network architecture and the optimal hyperparameter configuration that minimizes the loss function of the model. This article uses Google Cloud AutoML Vision to develop an end-to-end medical image classification model for Pneumonia Detection using Chest X-Ray Images. The dataset is hosted on Kaggle and can be accessed at Chest X-Ray Images (Pneumonia). Go to the cloud console: https://cloud.google.com/ Setup Project APIs, permissions and Cloud Storage bucket to store the image files for modeling and other assets.
AI is a general field that encompasses machine learning and deep learning. The history of artificial intelligence in its modern sense begins in the 1950s, with the works of Alan Turing and the Dartmouth workshop, which brought together the first enthusiasts of this field and in which the basic principles of the science of AI were formulated. Further, this industry experienced several cycles of a surge of interest and subsequent recessions (the so-called "AI winters"), in order to become one of the key areas of world science today. However, there are several examples and applications of artificial intelligence in use today, a large community of developers is still wondering how or from where to start developing AI-driven applications. So this article may be a kick start for those who are eager to start developing AI or ML-driven applications.
In this paper, we introduce STREETS, a novel traffic flow dataset from publicly available web cameras in the suburbs of Chicago, IL. We seek to address the limitations of existing datasets in this area. Many such datasets lack a coherent traffic network graph to describe the relationship between sensors. The datasets that do provide a graph depict traffic flow in urban population centers or highway systems and use costly sensors like induction loops. These contexts differ from that of a suburban traffic body.