Goto

Collaborating Authors

 Deep Learning


Healthy Cognitive Aging: A Hybrid Random Vector Functional-Link Model for the Analysis of Alzheimer’s Disease

AAAI Conferences

Alzheimer's disease (AD) is a genetically complex neurodegenerative disease, which leads to irreversible brain damage, severe cognitive problems and ultimately death. A number of clinical trials and study initiatives have been set up to investigate AD pathology, leading to large amounts of high dimensional heterogeneous data (biomarkers) for analysis. This paper focuses on combining clinical features from different modalities, including medical imaging, cerebrospinal fluid (CSF), etc., to diagnose AD and predict potential progression. Due to privacy and legal issues involved with clinical research, the study cohort (number of patients) is relatively small, compared to thousands of available biomarkers (predictors). We propose a hybrid pathological analysis model, which integrates manifold learning and Random Vector functional-link network (RVFL) so as to achieve better ability to extract discriminant information with limited training materials. Furthermore, we model (current and future) cognitive healthiness as a regression problem about age. By comparing the difference between predicted age and actual age, we manage to show statistical differences between different pathological stages. Verification tests are conducted based on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. Extensive comparison is made against different machine learning algorithms, i.e. Support Vector Machine (SVM), Random Forest (RF), Decision Tree and Multilayer Perceptron (MLP). Experimental results show that our proposed algorithm achieves better results than the comparison targets, which indicates promising robustness for practical clinical implementation.


Deep Gaussian Process for Crop Yield Prediction Based on Remote Sensing Data

AAAI Conferences

Agricultural monitoring, especially in developing countries, can help prevent famine and support humanitarian efforts. A central challenge is yield estimation, i.e., predicting crop yields before harvest. We introduce a scalable, accurate, and inexpensive method to predict crop yields using publicly available remote sensing data. Our approach improves existing techniques in three ways. First, we forego hand-crafted features traditionally used in the remote sensing community and propose an approach based on modern representation learning ideas. We also introduce a novel dimensionality reduction technique that allows us to train a Convolutional Neural Network or Long-short Term Memory network and automatically learn useful features even when labeled training data are scarce. Finally, we incorporate a Gaussian Process component to explicitly model the spatio-temporal structure of the data and further improve accuracy. We evaluate our approach on county-level soybean yield prediction in the U.S. and show that it outperforms competing techniques.


Combining Satellite Imagery and Open Data to Map Road Safety

AAAI Conferences

Improving road safety is critical for the sustainable development of cities. A road safety map is a powerful tool that can help prevent future traffic accidents. However, accurate mapping requires accurate data collection, which is both expensive and labor intensive. Satellite imagery is increasingly becoming abundant, higher in resolution and affordable. Given the recent successes deep learning has achieved in the visual recognition field, we are interested in investigating whether it is possible to use deep learning to accurately predict road safety directly from raw satellite imagery. To this end, we propose a deep learning-based mapping approach that leverages open data to learn from raw satellite imagery robust deep models able to predict accurate city-scale road safety maps at an affordable cost. To empirically validate the proposed approach, we trained a deep model on satellite images obtained from over 647 thousand traffic-accident reports collected over a period of four years by the New York city Police Department. The best model predicted road safety from raw satellite imagery with an accuracy of 78%. We also used the New York city model to predict for the city of Denver a city-scale map indicating road safety in three levels. Compared to a map made from three years' worth of data collected by the Denver city Police Department, the map predicted from raw satellite imagery has an accuracy of 73%.


Leveraging Video Descriptions to Learn Video Question Answering

AAAI Conferences

We propose a scalable approach to learn video-based question answering (QA): to answer a free-form natural language question about the contents of a video. Our approach automatically harvests a large number of videos and descriptions freely available online. Then, a large number of candidate QA pairs are automatically generated from descriptions rather than manually annotated. Next, we use these candidate QA pairs to train a number of video-based QA methods extended from MN (Sukhbaatar et al. 2015), VQA (Antol et al. 2015), SA (Yao et al. 2015), and SS (Venugopalan et al. 2015). In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training. Finally, we evaluate performance on manually generated video-based QA pairs. The results show that our self-paced learning procedure is effective, and the extended SS model outperforms various baselines.


Leveraging Saccades to Learn Smooth Pursuit: A Self-Organizing Motion Tracking Model Using Restricted Boltzmann Machines

AAAI Conferences

In this paper, we propose a biologically-plausible model to explain the emergence of motion tracking behaviour in early development using unsupervised learning. The model's training is biased by a concept called retinal constancy, which measures how similar visual contents are between successive frames. This biasing is similar to a reward in reinforcement learning, but is less explicit, as it modulates the model's learning rate instead of being a learning signal itself. The model is a two-layer deep network. The first layer learns to encode visual motion, and the second layer learns to relate that motion to gaze movements, which it perceives and creates through bi-directional nodes. By randomly generating gaze movements to traverse the local visual space, desirable correlations are developed between visual motion and the appropriate gaze to nullify that motion such that maximal retinal constancy is achieved. Biologically, this is similar to using saccades to look around and learning from moments where a target and the saccade move together such that the image stays the same on the retina, and developing smooth pursuit behaviour to perform this action in the future. Restricted Boltzmann machines are used to implement this model because they can form a deep belief network, perform online learning, and act generatively. These properties all have biological equivalents and coincide with the biological plausibility of using saccades as leverage to learn smooth pursuit. This method is unique because it uses general machine learning algorithms, and their inherent generative properties, to learn from real-world data. It also implements a biological theory, uses motion instead of recognition via local searches, without temporal filtering, and learns in a fully unsupervised manner. Its tracking performance after being trained on real-world images with simulated motion is compared to its tracking performance after being trained on natural video. Results show that this model is able to successfully follow targets in natural video, despite partial occlusions, scale changes, and nonlinear motion.


Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

AAAI Conferences

Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question: Are there any benefits to combining Inception architectures with residual connections? Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4 networks, we achieve 3.08% top-5 error on the test set of the ImageNet classification (CLS) challenge.


Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better than Transferring from RGB-CNNs

AAAI Conferences

Scene recognition with RGB images has been extensively studied and has reached very remarkable recognition levels, thanks to convolutional neural networks (CNN) and large scene datasets. In contrast, current RGB-D scene data is much more limited, so often leverages RGB large datasets, by transferring pretrained RGB CNN models and fine-tuning with the target RGB-D dataset. However, we show that this approach has the limitation of hardly reaching bottom layers, which is key to learn modality-specific features. In contrast, we focus on the bottom layers, and propose an alternative strategy to learn depth features combining local weakly supervised training from patches followed by global fine tuning with images. This strategy is capable of learning very discriminative depth-specific features with limited depth images, without resorting to Places-CNN. In addition we propose a modified CNN architecture to further match the complexity of the model and the amount of data available. For RGB-D scene recognition, depth and RGB features are combined by projecting them in a common space and further leaning a multilayer classifier, which is jointly optimized in an end-to-end network. Our framework achieves state-of-the-art accuracy on NYU2 and SUN RGB-D in both depth only and combined RGB-D data.


An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

AAAI Conferences

Human action recognition is an important task in computer vision. Extracting discriminative spatial and temporal features to model the spatial and temporal evolutions of different actions plays a key role in accomplishing this task. In this work, we propose an end-to-end spatial and temporal attention model for human action recognition from skeleton data. We build our model on top of the Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM), which learns to selectively focus on discriminative joints of skeleton within each frame of the inputs and pays different levels of attention to the outputs of different frames. Furthermore, to ensure effective training of the network, we propose a regularized cross-entropy loss to drive the model learning process and develop a joint training strategy accordingly. Experimental results demonstrate the effectiveness of the proposed model, both on the small human action recognition dataset of SBU and the currently largest NTU dataset.


Title Learning Latent Subevents in Activity Videos Using Temporal Attention Filters

AAAI Conferences

In this paper, we newly introduce the concept of temporal attention filters, and describe how they can be used for human activity recognition from videos. Many high-level activities are often composed of multiple temporal parts (e.g., sub-events) with different duration/speed, and our objective is to make the model explicitly learn such temporal structure using multiple attention filters and benefit from them. Our temporal filters are designed to be fully differentiable, allowing end-of-end training of the temporal filters together with the underlying frame-based or segment-based convolutional neural network architectures. This paper presents an approach of learning a set of optimal static temporal attention filters to be shared across different videos, and extends this approach to dynamically adjust attention filters per testing video using recurrent long short-term memory networks (LSTMs). This allows our temporal attention filters to learn latent sub-events specific to each activity. We experimentally confirm that the proposed concept of temporal attention filters benefits the activity recognition, and we visualize the learned latent sub-events.


Fully Convolutional Neural Networks with Full-Scale-Features for Semantic Segmentation

AAAI Conferences

In this work, we propose a novel method to involve full-scale-features into the fully convolutional neural networks (FCNs) for Semantic Segmentation. Current works on FCN has brought great advances in the task of semantic segmentation, but the receptive field, which represents region areas of input volume connected to any output neuron, limits the available information of output neuron's prediction accuracy. We investigate how to involve the full-scale or full-image features into FCNs to enrich the receptive field. Specially, the full-scale feature network (FFN) extends the full-connected network and makes an end-to-end unified training structure. It has two appealing properties. First, the introduction of full-scale-features is beneficial for prediction. We build a unified extracting network and explore several fusion functions for concatenating features. Amounts of experiments have been carried out to prove that full-scale-features makes fair accuracy raising. Second, FFN is applicable to many variants of FCN which could be regarded as a general strategy to improve the segmentation accuracy. Our proposed method is evaluated on PASCAL VOC 2012, and achieves a state-of-art result.