Deep Learning
Transitive Hashing Network for Heterogeneous Multimedia Retrieval
Cao, Zhangjie (Tsinghua University) | Long, Mingsheng (Tsinghua University) | Wang, Jianmin (Tsinghua University) | Yang, Qiang (Hong Kong University of Science and Technology)
Hashing is widely applied to large-scale multimedia retrieval due to the storage and retrieval efficiency. Cross-modal hashing enables efficient retrieval of one modality from database relevant to a query of another modality. Existing work on cross-modal hashing assumes that heterogeneous relationship across modalities is available for learning to hash. This paper relaxes this strict assumption by only requiring heterogeneous relationship in some auxiliary dataset different from the query or database domain. We design a novel hybrid deep architecture, transitive hashing network (THN), to jointly learn cross-modal correlation from the auxiliary dataset, and align the data distributions of the auxiliary dataset with that of the query or database domain, which generates compact transitive hash codes for efficient cross-modal retrieval. Comprehensive empirical evidence validates that the proposed THN approach yields state of the art retrieval performance on standard multimedia benchmarks, i.e. NUS-WIDE and ImageNet-YahooQA.
Volumetric ConvNets with Mixed Residual Connections for Automated Prostate Segmentation from 3D MR Images
Yu, Lequan (The Chinese University of Hong Kong) | Yang, Xin (The Chinese University of Hong Kong) | Chen, Hao (The Chinese University of Hong Kong) | Qin, Jing (The Hong Kong Polytechnic University) | Heng, Pheng Ann (The Chinese University of Hong Kong)
Automated prostate segmentation from 3D MR images is very challenging due to large variations of prostate shape and indistinct prostate boundaries. We propose a novel volumetric convolutional neural network (ConvNet) with mixed residual connections to cope with this challenging problem. Compared with previous methods, our volumetric ConvNet has two compelling advantages. First, it is implemented in a 3D manner and can fully exploit the 3D spatial contextual information of input data to perform efficient, precise and volume-to-volume prediction. Second and more important, the novel combination of residual connections (i.e., long and short) can greatly improve the training efficiency and discriminative capability of our network by enhancing the information propagation within the ConvNet both locally and globally. While the forward propagation of location information can improve the segmentation accuracy, the smooth backward propagation of gradient flow can accelerate the convergence speed and enhance the discrimination capability. Extensive experiments on the open MICCAI PROMISE12 challenge dataset corroborated the effectiveness of the proposed volumetric ConvNet with mixed residual connections. Our method ranked the first in the challenge, outperforming other competitors by a large margin with respect to most of evaluation metrics. The proposed volumetric ConvNet is general enough and can be easily extended to other medical image analysis tasks, especially ones with limited training data.
Towards Better Understanding the Clothing Fashion Styles: A Multimodal Deep Learning Approach
Ma, Yihui (Tsinghua University) | Jia, Jia (Tsinghua University) | Zhou, Suping ( Beijing University of Posts and Telecommunications ) | Fu, Jingtian (Tsinghua University) | Liu, Yejun (Tsinghua University) | Tong, Zijian ( Sogou Corporation )
In this paper, we aim to better understand the clothing fashion styles. There remain two challenges for us: 1) how to quantitatively describe the fashion styles of various clothing, 2) how to model the subtle relationship between visual features and fashion styles, especially considering the clothing collocations. Using the words that people usually use to describe clothing fashion styles on shopping websites, we build a Fashion Semantic Space (FSS) based on Kobayashi's aesthetics theory to describe clothing fashion styles quantitatively and universally. Then we propose a novel fashion-oriented multimodal deep learning based model, Bimodal Correlative Deep Autoencoder (BCDA) , to capture the internal correlation in clothing collocations. Employing the benchmark dataset we build with 32133 full-body fashion show images, we use BCDA to map the visual features to the FSS. The experiment results indicate that our model outperforms (+13% in terms of MSE) several alternative baselines, confirming that our model can better understand the clothing fashion styles. To further demonstrate the advantages of our model, we conduct some interesting case studies, including fashion trends analyses of brands, clothing collocation recommendation, etc.
Learning to Act by Predicting the Future
Dosovitskiy, Alexey, Koltun, Vladlen
We present an approach to sensorimotor control in immersive environments. Our approach utilizes a high-dimensional sensory stream and a lower-dimensional measurement stream. The cotemporal structure of these streams provides a rich supervisory signal, which enables training a sensorimotor control model by interacting with the environment. The model is trained using supervised learning techniques, but without extraneous supervision. It learns to act based on raw sensory input from a complex three-dimensional environment. The presented formulation enables learning without a fixed goal at training time, and pursuing dynamically changing goals at test time. We conduct extensive experiments in three-dimensional simulations based on the classical first-person game Doom. The results demonstrate that the presented approach outperforms sophisticated prior formulations, particularly on challenging tasks. The results also show that trained models successfully generalize across environments and goals. A model trained using the presented approach won the Full Deathmatch track of the Visual Doom AI Competition, which was held in previously unseen environments.
Telugu OCR Framework using Deep Learning
Achanta, Rakesh, Hastie, Trevor
In this paper, we address the task of Optical Character Recognition(OCR) for the Telugu script. We present an end-to-end framework that segments the text image, classifies the characters and extracts lines using a language model. The segmentation is based on mathematical morphology. The classification module, which is the most challenging task of the three, is a deep convolutional neural network. The language is modelled as a third degree markov chain at the glyph level. Telugu script is a complex alphasyllabary and the language is agglutinative, making the problem hard. In this paper we apply the latest advances in neural networks to achieve state-of-the-art error rates. We also review convolutional neural networks in great detail and expound the statistical justification behind the many tricks needed to make Deep Learning work.
Small Boxes Big Data: A Deep Learning Approach to Optimize Variable Sized Bin Packing
Mao, Feng, Blanco, Edgar, Fu, Mingang, Jain, Rohit, Gupta, Anurag, Mancel, Sebastien, Yuan, Rong, Guo, Stephen, Kumar, Sai, Tian, Yayang
Bin Packing problems have been widely studied because of their broad applications in different domains. Known as a set of NP-hard problems, they have different vari- ations and many heuristics have been proposed for obtaining approximate solutions. Specifically, for the 1D variable sized bin packing problem, the two key sets of optimization heuristics are the bin assignment and the bin allocation. Usually the performance of a single static optimization heuristic can not beat that of a dynamic one which is tailored for each bin packing instance. Building such an adaptive system requires modeling the relationship between bin features and packing perform profiles. The primary drawbacks of traditional AI machine learnings for this task are the natural limitations of feature engineering, such as the curse of dimensionality and feature selection quality. We introduce a deep learning approach to overcome the drawbacks by applying a large training data set, auto feature selection and fast, accurate labeling. We show in this paper how to build such a system by both theoretical formulation and engineering practices. Our prediction system achieves up to 89% training accuracy and 72% validation accuracy to select the best heuristic that can generate a better quality bin packing solution.
Learning without Forgetting
When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning with similar old and new task datasets for improved new task performance.
Google's "DeepMind' AI Understands The Benefits Of Betrayal
It's looking increasingly likely that artificial intelligence (AI) will be the harbinger of the next technological revolution. When it develops to the point wherein it is able to learn, think, and even "feel" without the input of a human – a truly "smart" AI – then everything we know will change, almost overnight. That's why it's so interesting to keep track of major milestones in the development of AIs that exist today, including that of Google's DeepMind neural network. It's already besting humanity in the gaming world, and a new in-house study reveals that Google is decidedly unsure whether or not the AI tends to prefer cooperative behaviors over aggressive, competitive ones. A team of Google acolytes set up two relatively simple scenarios in which to test whether neural networks are more likely to work together or destroy each other when faced with a resource problem.
Google's new AI has learned to become 'highly aggressive' in stressful situations
Late last year, famed physicist Stephen Hawking issued a warning that the continued advancement of artificial intelligence will either be "the best, or the worst thing, ever to happen to humanity". We've all seen the Terminator movies, and the apocalyptic nightmare that the self-aware AI system, Skynet, wrought upon humanity, and now results from recent behavior tests of Google's new DeepMind AI system are making it clear just how careful we need to be when building the robots of the future. In tests late last year, Google's DeepMind AI system demonstrated an ability to learn independently from its own memory, and beat the world's best Go playersat their own game. It's since been figuring out how to seamlessly mimic a human voice. Now, researchers have been testing its willingness to cooperate with others, and have revealed that when DeepMind feels like it's about to lose, it opts for "highly aggressive" strategies to ensure that it comes out on top. The Google team ran 40 million turns of a simple'fruit gathering' computer game that asks two DeepMind'agents' to compete against each other to gather as many virtual apples as they could.
Machine Learning: Is exploring learning rate manually still necessary with an exponential decaying learning rate?
If we have an initial learning rate high enough and a suitable decay factor for exponentially decaying the learning rate over a certain number of epoch, is it still need for us to manually explore the learning rate? Because if all goes well I believe the learning rate can automatically be sampled over a huge range of epoch. However, if we start off with a less than optimal learning rate, assuming the loss does not diverge to infinity, would the loss be less optimal than we have started with the optimal learning rate, even if we could reach the optimal learning rate through decaying the initial learning rate over time? Does the answer differ for a convex/non-convex loss? Specifically for deep learning problems, is an exponential decaying learning rate able to sample the learning rate better than done manually?