Vision
AI@NICTA
Barnes, Nick (NICTA) | Baumgartner, Peter (NICTA) | Caetano, Tiberio (NICTA) | Durrant-Whyte, Hugh (NICTA) | Klein, Gerwin (NICTA) | Sanderson, Penelope (University of Queensland) | Sattar, Abdul (Griffith University) | Stuckey, Peter (The University of Melbourne) | Thiebaux, Sylvie (The Australian National University) | Hentenryck, Pascal Van (University of Melbourne) | Walsh, Toby (NICTA)
NICTA is Australia's Information and Communications Technology (ICT) Centre of Excellence. It is the largest organization in Australia dedicated to ICT research. While it has close links with local universities, it is in fact an independent but not-for-profit company in the business of doing research, commercializing that research and training PhD students to do that research. Much of the work taking place at NICTA involves various topics in artificial intelligence. In this article, we survey some of the AI work being undertaken at NICTA.
Towards an Empathizing and Adaptive Storyteller System
Bae, Byung Chull (IT University of Copenhagen) | Brunete, Alberto (Carlos III University) | Malik, Usman (National University of Sciences and Technology) | Dimara, Evanthia (Universitรฉ Paris-Sud) | Jermsurawong, Jermsak (New York University Abu Dhabi) | Mavridis, Nikolaos ( New York University Abu Dhabi )
This paper describes our ongoing effort to build an empathizing and adaptive storyteller system. The system under development aims to utilize emotional expressions generated from an avatar or a humanoid robot in addition to the listenerโs responses which are monitored in real time, in order to deliver a story in an effective manner. We conducted a pilot study and the results were analyzed in two ways: first, through a survey questionnaire analysis based on the participantโs subjective ratings; second, through automated video analysis based on the participantโs emotional facial expression and eye blinking. The survey questionnaire results show that male participants have a tendency of more empathizing with a story character when a virtual storyteller is present, as compared to audio-only narration. The video analysis results show that the number of eye blinking of the participants is thought to be reciprocal to their attention.
Examples of Artificial Perceptions in Optical Character Recognition and Iris Recognition
Noaica, Cristina M., Badea, Robert, Motoc, Iulia M., Ghica, Claudiu G., Rosoiu, Alin C., Popescu-Bodorin, Nicolaie
This paper assumes the hypothesis that human learning is perception based, and consequently, the learning process and perceptions should not be represented and investigated independently or modeled in different simulation spaces. In order to keep the analogy between the artificial and human learning, the former is assumed here as being based on the artificial perception. Hence, instead of choosing to apply or develop a Computational Theory of (human) Perceptions, we choose to mirror the human perceptions in a numeric (computational) space as artificial perceptions and to analyze the interdependence between artificial learning and artificial perception in the same numeric space, using one of the simplest tools of Artificial Intelligence and Soft Computing, namely the perceptrons. As practical applications, we choose to work around two examples: Optical Character Recognition and Iris Recognition. In both cases a simple Turing test shows that artificial perceptions of the difference between two characters and between two irides are fuzzy, whereas the corresponding human perceptions are, in fact, crisp.
A Bayesian Nonparametric Approach to Image Super-resolution
Polatkan, Gungor, Zhou, Mingyuan, Carin, Lawrence, Blei, David, Daubechies, Ingrid
Super-resolution methods form high-resolution images from low-resolution images. In this paper, we develop a new Bayesian nonparametric model for super-resolution. Our method uses a beta-Bernoulli process to learn a set of recurring visual patterns, called dictionary elements, from the data. Because it is nonparametric, the number of elements found is also determined from the data. We test the results on both benchmark and natural images, comparing with several other models from the research literature. We perform large-scale human evaluation experiments to assess the visual quality of the results. In a first implementation, we use Gibbs sampling to approximate the posterior. However, this algorithm is not feasible for large-scale data. To circumvent this, we then develop an online variational Bayes (VB) algorithm. This algorithm finds high quality dictionaries in a fraction of the time needed by the Gibbs sampler.
Contextually Guided Semantic Labeling and Search for 3D Point Clouds
Anand, Abhishek, Koppula, Hema Swetha, Joachims, Thorsten, Saxena, Ashutosh
RGB-D cameras, which give an RGB image to- gether with depths, are becoming increasingly popular for robotic perception. In this paper, we address the task of detecting commonly found objects in the 3D point cloud of indoor scenes obtained from such cameras. Our method uses a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurence relationships and geometric relationships. With a large number of object classes and relations, the model's parsimony becomes important and we address that by using multiple types of edge potentials. We train the model using a maximum-margin learning approach. In our experiments over a total of 52 3D scenes of homes and offices (composed from about 550 views), we get a performance of 84.06% and 73.38% in labeling office and home scenes respectively for 17 object classes each. We also present a method for a robot to search for an object using the learned model and the contextual information available from the current labelings of the scene. We applied this algorithm successfully on a mobile robot for the task of finding 12 object classes in 10 different offices and achieved a precision of 97.56% with 78.43% recall.
A Unified Approach for Modeling and Recognition of Individual Actions and Group Activities
Recognizing group activities is challenging due to the difficulties in isolating individual entities, finding the respective roles played by the individuals and representing the complex interactions among the participants. Individual actions and group activities in videos can be represented in a common framework as they share the following common feature: both are composed of a set of low-level features describing motions, e.g., optical flow for each pixel or a trajectory for each feature point, according to a set of composition constraints in both temporal and spatial dimensions. In this paper, we present a unified model to assess the similarity between two given individual or group activities. Our approach avoids explicit extraction of individual actors, identifying and representing the inter-person interactions. With the proposed approach, retrieval from a video database can be performed through Query-by-Example; and activities can be recognized by querying videos containing known activities. The suggested video matching process can be performed in an unsupervised manner. We demonstrate the performance of our approach by recognizing a set of human actions and football plays.
Information-theoretic Dictionary Learning for Image Classification
Qiu, Qiang, Patel, Vishal M., Chellappa, Rama
We present a two-stage approach for learning dictionaries for object classification tasks based on the principle of information maximization. The proposed method seeks a dictionary that is compact, discriminative, and generative. In the first stage, dictionary atoms are selected from an initial dictionary by maximizing the mutual information measure on dictionary compactness, discrimination and reconstruction. In the second stage, the selected dictionary atoms are updated for improved reconstructive and discriminative power using a simple gradient ascent algorithm on mutual information. Experiments using real datasets demonstrate the effectiveness of our approach for image classification tasks.
Multidimensional Membership Mixture Models
Jiang, Yun, Lim, Marcus, Saxena, Ashutosh
We present the multidimensional membership mixture (M3) models where every dimension of the membership represents an independent mixture model and each data point is generated from the selected mixture components jointly. This is helpful when the data has a certain shared structure. For example, three unique means and three unique variances can effectively form a Gaussian mixture model with nine components, while requiring only six parameters to fully describe it. In this paper, we present three instantiations of M3 models (together with the learning and inference algorithms): infinite, finite, and hybrid, depending on whether the number of mixtures is fixed or not. They are built upon Dirichlet process mixture models, latent Dirichlet allocation, and a combination respectively. We then consider two applications: topic modeling and learning 3D object arrangements. Our experiments show that our M3 models achieve better performance using fewer topics than many classic topic models. We also observe that topics from the different dimensions of M3 models are meaningful and orthogonal to each other.
Learning Games from Videos Guided by Descriptive Complexity
Kaiser, Lukasz (LIAFA, CNRS and Universite Paris Diderot)
In recent years, several systems have been proposed that learn the rules of a simple card or board game solely from visual demonstration. These systems were constructed for specific games and rely on substantial background knowledge. We introduce a general system for learning board game rules from videos and demonstrate it on several well-known games. The presented algorithm requires only a few demonstrations and minimal background knowledge, and, having learned the rules, automatically derives position evaluation functions and can play the learned games competitively. Our main technique is based on descriptive complexity, i.e. the logical means necessary to define a set of interest. We compute formulas defining allowed moves and final positions in a game in different logics and select the most adequate ones. We show that this method is well-suited for board games and there is strong theoretical evidence that it will generalize to other problems.
Visual Saliency Map from Tensor Analysis
Li, Bing (Chinese Academy of Sciences) | Xiong, Weihua (Omnivision Corporation) | Hu, Weiming (Chinese Academy of Sciences)
Modeling visual saliency map of an image provides important information for image semantic understanding in many applications. Most existing computational visual saliency models follow a bottom-up framework that generates independent saliency map in each selected visual feature space and combines them in a proper way. Two big challenges to be addressed explicitly in these methods are (1) which features should be extracted for all pixels of the input image and (2) how to dynamically determine importance of the saliency map generated in each feature space. In order to address these problems, we present a novel saliency map computational model based on tensor decomposition and reconstruction. Tensor representation and analysis not only explicitly represent image's color values but also imply two important relationships inherent to color image. One is reflecting spatial correlations between pixels and the other one is representing interplay between color channels. Therefore, saliency map generator based on the proposed model can adaptively find the most suitable features and their combinational coefficients for each pixel. Experiments on a synthetic image set and a real image set show that our method is superior or comparable to other prevailing saliency map models.