Education
Hierarchical LSTM for Sign Language Translation
Guo, Dan (Hefei University of Technology) | Zhou, Wengang (University of Science and Technology of China) | Li, Houqiang (University of Science and Technology of China) | Wang, Meng (Hefei University of Technology)
Continuous Sign Language Translation (SLT) is a challenging task due to its specific linguistics under sequential gesture variation without word alignment. Current hybrid HMM and CTC (Connectionist temporal classification) based models are proposed to solve frame or word level alignment. They may fail to tackle the cases with messing word order corresponding to visual content in sentences. To solve the issue, this paper proposes a hierarchical-LSTM (HLSTM) encoder-decoder model with visual content and word embedding for SLT. It tackles different granularities by conveying spatio-temporal transitions among frames, clips and viseme units. It firstly explores spatio-temporal cues of video clips by 3D CNN and packs appropriate visemes by online key clip mining with adaptive variable-length. After pooling on recurrent outputs of the top layer of HLSTM, a temporal attention-aware weighting mechanism is proposed to balance the intrinsic relationship among viseme source positions. At last, another two LSTM layers are used to separately recurse viseme vectors and translate semantic. After preserving original visual content by 3D CNN and the top layer of HLSTM, it shortens the encoding time step of the bottom two LSTM layers with less computational complexity while attaining more nonlinearity. Our proposed model exhibits promising performance on singer-independent test with seen sentences and also outperforms the comparison algorithms on unseen sentences.
Unsupervised Selection of Negative Examples for Grounded Language Learning
Pillai, Nisha (University of Maryland, Baltimore County) | Matuszek, Cynthia (University of Maryland, Baltimore County)
There has been substantial work in recent years on grounded language acquisition, in which language and sensor data are used to create a model relating linguistic constructs to the perceivable world. While powerful, this approach is frequently hindered by ambiguities, redundancies, and omissions found in natural language. We describe an unsupervised system that learns language by training visual classifiers, first selecting important terms from object descriptions, then automatically choosing negative examples from a paired corpus of perceptual and linguistic data. We evaluate the effectiveness of each stage as well as the system's performance on the overall learning task.
Hierarchical Attention Flow for Multiple-Choice Reading Comprehension
Zhu, Haichao (Harbin Institute of Technology) | Wei, Furu (Microsoft Research) | Qin, Bing (Harbin Institute of Technology) | Liu, Ting (Harbin Institute of Technology)
In this paper, we focus on multiple-choice reading comprehension which aims to answer a question given a passage and multiple candidate options. We present the hierarchical attention flow to adequately leverage candidate options to model the interactions among passages, questions and candidate options. We observe that leveraging candidate options to boost evidence gathering from the passages play a vital role in this task, which is ignored in previous works. In addition, we explicitly model the option correlations with attention mechanism to obtain better option representations, which are further fed into a bilinear layer to obtain the ranking score for each option. On a large-scale multiple-choice reading comprehension dataset (i.e. the RACE dataset), the proposed model outperforms two previous neural network baselines on both RACE-M and RACE-H subsets and yields the state-of-the-art overall results.
SkipFlow: Incorporating Neural Coherence Features for End-to-End Automatic Text Scoring
Tay, Yi (Nanyang Technological University) | Phan, Minh C. (Nanyang Technological University) | Tuan, Luu Anh (Agency for Science and Technology Research (A*Star), Institute for Infocomm Research) | Hui, Siu Cheung (Nanyang Technological University)
Deep learning has demonstrated tremendous potential for Automatic Text Scoring (ATS) tasks. In this paper, we describe a new neural architecture that enhances vanilla neural network models with auxiliary neural coherence features. Our new method proposes a new SkipFlow mechanism that models relationships between snapshots of the hidden representations of a long short-term memory (LSTM) network as it reads. Subsequently, the semantic relationships between multiple snapshots are used as auxiliary features for prediction. This has two main benefits. Firstly, essays are typically long sequences and therefore the memorization capability of the LSTM network may be insufficient. Implicit access to multiple snapshots can alleviate this problem by acting as a protection against vanishing gradients. The parameters of the SkipFlow mechanism also acts as an auxiliary memory. Secondly, modeling relationships between multiple positions allows our model to learn features that represent and approximate textual coherence. In our model, we call this neural coherence features. Overall, we present a unified deep learning architecture that generates neural coherence features as it reads in an end-to-end fashion. Our approach demonstrates state-of-the-art performance on the benchmark ASAP dataset, outperforming not only feature engineering baselines but also other deep learning models.
S-Net: From Answer Extraction to Answer Synthesis for Machine Reading Comprehension
Tan, Chuanqi (Beihang University) | Wei, Furu (Microsoft Research) | Yang, Nan (Microsoft Research) | Du, Bowen (Beihang University) | Lv, Weifeng (Beihang University) | Zhou, Ming (Microsoft Research)
In this paper, we present a novel approach to machine reading comprehension for the MS-MARCO dataset. Unlike the SQuAD dataset that aims to answer a question with exact text spans in a passage, the MS-MARCO dataset defines the task as answering a question from multiple passages and the words in the answer are not necessary in the passages. We therefore develop an extraction-then-synthesis framework to synthesize answers from extraction results. Specifically, the answer extraction model is first employed to predict the most important sub-spans from the passage as evidence, and the answer synthesis model takes the evidence as additional features along with the question and passage to further elaborate the final answers. We build the answer extraction model with state-of-the-art neural networks for single passage reading comprehension, and propose an additional task of passage ranking to help answer extraction in multiple passages. The answer synthesis model is based on the sequence-to-sequence neural networks with extracted evidences as features. Experiments show that our extraction-then-synthesis method outperforms state-of-the-art methods.
Argument Mining for Improving the Automated Scoring of Persuasive Essays
Nguyen, Huy V. (University of Pittsburgh) | Litman, Diane J. (University of Pittsburgh)
End-to-end argument mining has enabled the development of new automated essay scoring (AES) systems that use argumentative features (e.g., number of claims, number of support relations) in addition to traditional legacy features (e.g., grammar, discourse structure) when scoring persuasive essays. While prior research has proposed different argumentative features as well as empirically demonstrated their utility for AES, these studies have all had important limitations. In this paper we identify a set of desiderata for evaluating the use of argument mining for AES, introduce an end-to-end argument mining system and associated argumentative feature sets, and present the results of several studies that both satisfy the desiderata and demonstrate the value-added of argument mining for scoring persuasive essays.
Medical Exam Question Answering with Large-scale Reading Comprehension
Zhang, Xiao (Tsinghua University) | Wu, Ji (Tsinghua University) | He, Zhiyang (Tsinghua University) | Liu, Xien (iFlytek) | Su, Ying (iFlytek)
Reading and understanding text is one important component in computer aided diagnosis in clinical medicine, also being a major research problem in the field of NLP.  In this work, we introduce a question-answering task called MedQA to study answering questions in clinical medicine using knowledge in a large-scale document collection. The aim of MedQA is to answer real-world questions with large-scale reading comprehension. We propose our solution SeaReader---a modular end-to-end reading comprehension model based on LSTM networks and dual-path attention architecture. The novel dual-path attention models information flow from two perspectives and has the ability to simultaneously read individual documents and integrate information across multiple documents. In experiments our SeaReader achieved a large increase in accuracy on MedQA over competing models.  Additionally, we develop a series of novel techniques to demonstrate the interpretation of the question answering process in SeaReader.
Guiding Exploratory Behaviors for Multi-Modal Grounding of Linguistic Descriptions
Thomason, Jesse (University of Texas at Austin) | Sinapov, Jivko (Tufts University) | Mooney, Raymond J. (University of Texas at Austin) | Stone, Peter (University of Texas at Austin)
A major goal of grounded language learning research is to enable robots to connect language predicates to a robot's physical interactive perception of the world. Coupling object exploratory behaviors such as grasping, lifting, and looking with multiple sensory modalities (e.g., audio, haptics, and vision) enables a robot to ground non-visual words like ``heavy'' as well as visual words like ``red''. A major limitation of existing approaches to multi-modal language grounding is that a robot has to exhaustively explore training objects with a variety of actions when learning a new such language predicate. This paper proposes a method for guiding a robot's behavioral exploration policy when learning a novel predicate based on known grounded predicates and the novel predicate's linguistic relationship to them. We demonstrate our approach on two datasets in which a robot explored large sets of objects and was tasked with learning to recognize whether novel words applied to those objects.
CoChat: Enabling Bot and Human Collaboration for Task Completion
Luo, Xufang (Beihang University) | Lin, Zijia (Microsoft Research) | Wang, Yunhong (Beihang University) | Nie, Zaiqing (Alibaba AI Labs)
Chatbots have drawn significant attention of late in both industry and academia. For most task completion bots in the industry, human intervention is the only means of avoiding mistakes in complex real-world cases. However, to the best of our knowledge, there is no existing research work modeling the collaboration between task completion bots and human workers. In this paper, we introduce CoChat, a dialog management framework to enable effective collaboration between bots and human workers. In CoChat, human workers can introduce new actions at any time to handle previously unseen cases. We propose a memory-enhanced hierarchical RNN (MemHRNN) to handle the one-shot learning challenges caused by instantly introducing new actions in CoChat. Extensive experiments on real-world datasets well demonstrate that CoChat can relieve most of the human workers’ workload, and get better user satisfaction rates comparing to other state-of-the-art frameworks.
Customized Nonlinear Bandits for Online Response Selection in Neural Conversation Models
Liu, Bing (Carnegie Mellon University) | Yu, Tong ( Carnegie Mellon University ) | Lane, Ian ( Carnegie Mellon University ) | Mengshoel, Ole J. (Carnegie Mellon University)
Dialog response selection is an important step towards natural response generation in conversational agents. Existing work on neural conversational models mainly focuses on offline supervised learning using a large set of context-response pairs. In this paper, we focus on online learning of response selection in retrieval-based dialog systems. We propose a contextual multi-armed bandit model with a nonlinear reward function that uses distributed representation of text for online response selection. A bidirectional LSTM is used to produce the distributed representations of dialog context and responses, which serve as the input to a contextual bandit. In learning the bandit, we propose a customized Thompson sampling method that is applied to a polynomial feature space in approximating the reward. Experimental results on the Ubuntu Dialogue Corpus demonstrate significant performance gains of the proposed method over conventional linear contextual bandits. Moreover, we report encouraging response selection performance of the proposed neural bandit model using the Recall@k metric for a small set of online training samples.