Deep Learning
Contextual RNN-GANs for Abstract Reasoning Diagram Generation
Kulharia, Viveka (Indian Institute of Technology, Kanpur) | Ghosh, Arnab (Indian Institute of Technology, Kanpur) | Mukerjee, Amitabha (Indian Institute of Technology, Kanpur) | Namboodiri, Vinay (Indian Institute of Technology, Kanpur) | Bansal, Mohit (University of North Carolina, Chapel Hill)
Understanding object motions and transformations is a core problem in computer science. Modeling sequences of evolving images may provide better representations and models of motion and may ultimately be used for forecasting or simulation. Diagrammatic Abstract Reasoning is an avenue in which diagrams evolve in complex patterns and one needs to infer the underlying pattern sequence and generate the next image in the sequence. For this, we develop a novel Contextual Generative Adversarial Network based on Recurrent Neural Networks (Context-RNN-GANs), where both the generator and the discriminator modules are based on contextual history and the adversarial discriminator guides the generator to produce realistic images for the particular time step in the image sequence. We employ the Context-RNN-GAN model (and its variants) on a novel dataset of Diagrammatic Abstract Reasoning as well as perform initial evaluations on a next-frame prediction task of videos. Empirically, we show that our Context-RNN-GAN model performs competitively with 10th-grade human performance but there is still scope for interesting improvements as compared to college-grade human performance.
Question Dif๏ฌculty Prediction for READING Problems in Standard Tests
Huang, Zhenya (University of Science and Technology of China) | Liu, Qi (University of Science and Technology of China) | Chen, Enhong (University of Science and Technology of China) | Zhao, Hongke (University of Science and Technology of China) | Gao, Mingyong ( iFLYTEK Co., Ltd. ) | Wei, Si ( iFLYTEK Co., Ltd. ) | Su, Yu (Anhui University) | Hu, Guoping ( iFLYTEK Co., Ltd. )
Standard tests aim to evaluate the performance of examinees using different tests with consistent difficulties. Thus, a critical demand is to predict the difficulty of each test question before the test is conducted. Existing studies are usually based on the judgments of education experts (e.g., teachers), which may be subjective and labor intensive. In this paper, we propose a novel Test-aware Attention-based Convolutional Neural Network (TACNN) framework to automatically solve this Question Difficulty Prediction (QDP) task for READING problems (a typical problem style in English tests) in standard tests. Specifically, given the abundant historical test logs and text materials of questions, we first design a CNN-based architecture to extract sentence representations for the questions. Then, we utilize an attention strategy to qualify the difficulty contribution of each sentence to questions. Considering the incomparability of question difficulties in different tests, we propose a test-dependent pairwise strategy for training TACNN and generating the difficulty prediction value. Extensive experiments on a real-world dataset not only show the effectiveness of TACNN, but also give interpretable insights to track the attention information for questions.
A Hybrid Collaborative Filtering Model with Deep Structure for Recommender Systems
Dong, Xin (Ctrip Travel Network Technology (Shanghai) Co., Limited.) | Yu, Lei (Ctrip Travel Network Technology (Shanghai) Co., Limited.) | Wu, Zhonghuo (Ctrip Travel Network Technology (Shanghai) Co., Limited.) | Sun, Yuxia (Ctrip Travel Network Technology (Shanghai) Co., Limited.) | Yuan, Lingfeng (Ctrip Travel Network Technology (Shanghai) Co., Limited.) | Zhang, Fangxi (Ctrip Travel Network Technology (Shanghai) Co., Limited.)
Collaborative filtering (CF) is a widely used approach in recommender systems to solve many real-world problems. Traditional CF-based methods employ the user-item matrix which encodes the individual preferences of users for items for learning to make recommendation. In real applications, the rating matrix is usually very sparse, causing CF-based methods to degrade significantly in recommendation performance. In this case, some improved CF methods utilize the increasing amount of side information to address the data sparsity problem as well as the cold start problem. However, the learned latent factors may not be effective due to the sparse nature of the user-item matrix and the side information. To address this problem, we utilize advances of learning effective representations in deep learning, and propose a hybrid model which jointly performs deep users and itemsโ latent factors learning from side information and collaborative filtering from the rating matrix. Extensive experimental results on three real-world datasets show that our hybrid model outperforms other methods in effectively utilizing side information and achieves performance improvement.
Don't Forget the Quantifiable Relationship between Words: Using Recurrent Neural Network for Short Text Topic Discovery
Lu, Heng-Yang (Nanjing University) | Xie, Lu-Yao (Nanjing University) | Kang, Ning (Nanjing University) | Wang, Chong-Jun (Nanjing University) | Xie, Jun-Yuan (Nanjing University)
In our daily life, short texts have been everywhere especially since the emergence of social network. There are countless short texts in online media like twitter, online Q&A sites and so on. Discovering topics is quite valuable in various application domains such as content recommendation and text characterization. Traditional topic models like LDA are widely applied for sorts of tasks, but when it comes to short text scenario, these models may get stuck due to the lack of words. Recently, a popular model named BTM uses word co-occurrence relationship to solve the sparsity problem and is proved effectively. However, both BTM and extended models ignore the inside relationship between words. From our perspectives, more related words should appear in the same topic. Based on this idea, we propose a model named RIBS-TM which makes use of RNN for relationship learning and IDF for filtering high-frequency words. Experiments on two real-world short text datasets show great utility of our model.
Predicting Latent Narrative Mood Using Audio and Physiologic Data
AlHanai, Tuka Waddah (Massachusetts Institute of Technology) | Ghassemi, Mohammad Mahdi (Massachusetts Institute of Technology)
Inferring the latent emotive content of a narrative requires consideration of para-linguistic cues (e.g. pitch), linguistic content (e.g. vocabulary) and the physiological state of the narrator (e.g. heart-rate). In this study we utilized a combination of auditory, text, and physiological signals to predict the mood (happy or sad) of 31 narrations from subjects engaged in personal story-telling. We extracted 386 audio and 222 physiological features (using the Samsung Simband) from the data. A subset of 4 audio, 1 text, and 5 physiologic features were identified using Sequential Forward Selection (SFS) for inclusion in a Neural Network (NN). These features included subject movement, cardiovascular activity, energy in speech, probability of voicing, and linguistic sentiment (i.e. negative or positive). We explored the effects of introducing our selected features at various layers of the NN and found that the location of these features in the network topology had a significant impact on model performance. To ensure the real-time utility of the model, classification was performed over 5 second intervals. We evaluated our modelโs performance using leave-one-subject-out crossvalidation and compared the performance to 20 baseline models and a NN with all features included in the input layer.
Examples-Rules Guided Deep Neural Network for Makeup Recommendation
Alashkar, Taleb (Northeastern University) | Jiang, Songyao (Northeastern University ) | Wang, Shuyang (Northeastern University) | Fu, Yun (Northeastern University)
In this paper, we consider a fully automatic makeup recommendation system and propose a novel examples-rules guided deep neural network approach. The framework consists of three stages. First, makeup-related facial traits are classified into structured coding. Second, these facial traits are fed in- to examples-rules guided deep neural recommendation model which makes use of the pairwise of Before-After images and the makeup artist knowledge jointly. Finally, to visualize the recommended makeup style, an automatic makeup synthesis system is developed as well. To this end, a new Before-After facial makeup database is collected and labeled manually, and the knowledge of makeup artist is modeled by knowledge base system. The performance of this framework is evaluated through extensive experimental analyses. The experiments validate the automatic facial traits classification, the recommendation effectiveness in statistical and perceptual ways and the makeup synthesis accuracy which outperforms the state of the art methods by large margin. It is also worthy to note that the proposed framework is a pioneering fully automatic makeup recommendation systems to our best knowledge.
Efficient Hyperparameter Optimization for Deep Learning Algorithms Using Deterministic RBF Surrogates
Ilievski, Ilija (National University of Singapore) | Akhtar, Taimoor (National University of Singapore) | Feng, Jiashi (National University of Singapore) | Shoemaker, Christine Annette (National University of Singapore)
Automatically searching for optimal hyperparameter configurations is of crucial importance for applying deep learning algorithms in practice. Recently, Bayesian optimization has been proposed for optimizing hyperparameters of various machine learning algorithms. Those methods adopt probabilistic surrogate models like Gaussian processes to approximate and minimize the validation error function of hyperparameter values. However, probabilistic surrogates require accurate estimates of sufficient statistics (e.g., covariance) of the error distribution and thus need many function evaluations with a sizeable number of hyperparameters. This makes them inefficient for optimizing hyperparameters of deep learning algorithms, which are highly expensive to evaluate. In this work, we propose a new deterministic and efficient hyperparameter optimization method that employs radial basis functions as error surrogates. The proposed mixed integer algorithm, called HORD, searches the surrogate for the most promising hyperparameter values through dynamic coordinate search and requires many fewer function evaluations. HORD does well in low dimensions but it is exceptionally better in higher dimensions. Extensive evaluations on MNIST and CIFAR-10 for four deep neural networks demonstrate HORD significantly outperforms the well-established Bayesian optimization methods such as GP, SMAC, and TPE. For instance, on average, HORD is more than 6 times faster than GP-EI in obtaining the best configuration of 19 hyperparameters.
Visual Sentiment Analysis by Attending on Local Image Regions
You, Quanzeng (University of Rochester) | Jin, Hailin (Adobe) | Luo, Jiebo (University of Rochester)
Visual sentiment analysis, which studies the emotional response of humans on visual stimuli such as images and videos, has been an interesting and challenging problem. It tries to understand the high-level content of visual data. The success of current models can be attributed to the development of robust algorithms from computer vision. Most of the existing models try to solve the problem by proposing either robust features or more complex models. In particular, visual features from the whole image or video are the main proposed inputs. Little attention has been paid to local areas, which we believe is pretty relevant to human's emotional response to the whole image. In this work, we study the impact of local image regions on visual sentiment analysis. Our proposed model utilizes the recent studied attention mechanism to jointly discover the relevant local regions and build a sentiment classifier on top of these local regions. The experimental results suggest that 1) our model is capable of automatically discovering sentimental local regions of given images and 2) it outperforms existing state-of-the-art algorithms to visual sentiment analysis.
Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture
Nishida, Kyosuke (NTT Corporation) | Sadamitsu, Kugatsu (NTT Corporation) | Higashinaka, Ryuichiro (NTT Corporation) | Matsuo, Yoshihiro (NTT Corporation)
We propose a new deep neural network architecture, TabNet, for table type classification. Table type is essential information for exploring the power of Web tables, and it is important to understand the semantic structures of tables in order to classify them correctly. A table is a matrix of texts, analogous to an image, which is a matrix of pixels, and each text consists of a sequence of tokens. Our hybrid architecture mirrors the structure of tables: its recurrent neural network (RNN) encodes a sequence of tokens for each cell to create a 3d table volume like image data, and its convolutional neural network (CNN) captures semantic features, e.g., the existence of rows describing properties, to classify tables. Experiments using Web tables with various structures and topics demonstrated that TabNet achieved considerable improvements over state-of-the-art methods specialized for table classification and other deep neural network architectures.
Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems
Ning, Yishuang (Tsinghua University) | Jia, Jia (Tsinghua University) | Wu, Zhiyong (Tsinghua University) | Li, Runnan (Tsinghua University) | An, Yongsheng (Tsinghua University) | Wang, Yanfeng (Beijing Sougou Science and Technology Development Co., Ltd.) | Meng, Helen (The Chinese University of Hong Kong)
Speech interaction systems have been gaining popularity in recent years. The main purpose of these systems is to generate more satisfactory responses according to users' speech utterances, in which the most critical problem is to analyze user intention. Researches show that user intention conveyed through speech is not only expressed by content, but also closely related with users' speaking manners (e.g. with or without acoustic emphasis). How to incorporate these heterogeneous attributes to infer user intention remains an open problem. In this paper, we define Intention Prominence (IP) as the semantic combination of focus by text and emphasis by speech, and propose a multi-task deep learning framework to predict IP. Specifically, we first use long short-term memory (LSTM) which is capable of modeling long short-term contextual dependencies to detect focus and emphasis, and incorporate the tasks for focus and emphasis detection with multi-task learning (MTL) to reinforce the performance of each other. We then employ Bayesian network (BN) to incorporate multimodal features (focus, emphasis, and location reflecting users' dialect conventions) to predict IP based on feature correlations. Experiments on a data set of 135,566 utterances collected from real-world Sogou Voice Assistant illustrate that our method can outperform the comparison methods over 6.9-24.5% in terms of F1-measure. Moreover, a real practice in the Sogou Voice Assistant indicates that our method can improve the performance on user intention understanding by 7%.