National Tsing Hua University
Tap and Shoot Segmentation
Chen, Ding-Jie (National Tsing Hua University) | Chien, Jui-Ting (National Tsing Hua University) | Chen, Hwann-Tzong (National Tsing Hua University) | Chang, Long-Wen (National Tsing Hua University)
We present a new segmentation method that leverages latent photographic information available at the moment of taking pictures. Photography on a portable device is often done by tapping to focus before shooting the picture. This tap-and-shoot interaction for photography not only specifies the region of interest but also yields useful focus/defocus cues for image segmentation. However, most of the previous interactive segmentation methods address the problem of image segmentation in a post-processing scenario without considering the action of taking pictures. We propose a learning-based approach to this new tap-and-shoot scenario of interactive segmentation. The experimental results on various datasets show that, by training a deep convolutional network to integrate the selection and focus/defocus cues, our method can achieve higher segmentation accuracy in comparison with existing interactive segmentation methods.
On Organizing Online Soirees with Live Multi-Streaming
Shen, Chih-Ya (National Tsing Hua University) | Fotsing, C. P. Kankeu (Academia Sinica) | Yang, De-Nian (Academia Sinica) | Chen, Yi-Shin (National Tsing Hua University) | Lee, Wang-Chien (The Pennsylvania State University)
The popularity of live streaming has led to the explosive growth in new video contents and social communities on emerging platforms such as Facebook Live and Twitch. Viewers on these platforms are able to follow multiple streams of live events simultaneously, while engaging discussions with friends. However, existing approaches for selecting live streaming channels still focus on satisfying individual preferences of users, without considering the need to accommodate real-time social interactions among viewers and to diversify the content of streams. In this paper, therefore, we formulate a new Social-aware Diverse and Preferred Live Streaming Channel Query (SDSQ) that jointly selects a set of diverse and preferred live streaming channels and a group of socially tight viewers. We prove that SDSQ is NP-hard and inapproximable within any factor, and design SDSSel, a 2-approximation algorithm with a guaranteed error bound. We perform a user study on Twitch with 432 participants to validate the need of SDSQ and the usefulness of SDSSel. We also conduct large-scale experiments on real datasets to demonstrate the superiority of the proposed algorithm over several baselines in terms of solution quality and efficiency.
Self-View Grounding Given a Narrated 360ยฐ Video
Chou, Shih-Han (National Tsing Hua University) | Chen, Yi-Chun (National Tsing Hua University) | Zeng, Kuo-Hao (National Tsing Hua University) | Hu, Hou-Ning (National Tsing Hua University) | Fu, Jianlong (Microsoft Research, Beijing) | Sun, Min (National Tsing Hua University)
Narrated 360ยฐ videos are typically provided in many touring scenarios to mimic real-world experience. However, previous work has shown that smart assistance (i.e., providing visual guidance) can significantly help users to follow the Normal Field of View (NFoV) corresponding to the narrative.In this project, we aim at automatically grounding the NFoVs of a 360ยฐ video given subtitles of the narrative (referred to as ''NFoV-grounding"). We propose a novel Visual Grounding Model (VGM) to implicitly and efficiently predict the NFoVs given the video content and subtitles. Specifically, at each frame, we efficiently encode the panorama into feature map of candidate NFoVs using a Convolutional Neural Network (CNN) and the subtitles to the same hidden space using an RNN with Gated Recurrent Units (GRU). Then, we apply soft-attention on candidate NFoVs to trigger sentence decoder aiming to minimize the reconstruct loss between the generated and given sentence. Finally, we obtain the NFoV as the candidate NFoV with the maximum attention without any human supervision.To train VGM more robustly, we also generate a reverse sentence conditioning on one minus the soft-attention such that the attention focuses on candidate NFoVs less relevant to the given sentence. The negative log reconstruction loss of the reverse sentence (referred to as ''irrelevant loss") is jointly minimized to encourage the reverse sentence to be different from the given sentence. To evaluate our method, we collect the first narrated 360ยฐ videos dataset and achieve state-of-the-art NFoV-grounding performance.
Supporting ESL Writing by Prompting Crowdsourced Structural Feedback
Huang, Yi-Ching (National Taiwan University) | Huang, Jiunn-Chia (National Taiwan University) | Wang, Hao-Chuan (National Tsing Hua University) | Hsu, Jane Yung-jen (National Taiwan University)
Writing is challenging, especially for non-native speakers. To support English as a Second Language (ESL) writing, we propose StructFeed, which allows native speakers to annotate topic sentence and relevant keywords in texts and generate writing hints based on the principle of paragraph unity. First, we compared our crowd-based method with three naive machine learning (ML) methods and got the best performance on the identification of topic sentence and irrelevant sentence in the article. Next, we evaluated the StructFeed system with two feedback-generation mechanisms including feedback generated by one expert and by one crowd worker. The results showed that people who received feedback by StructFeed got the highest improvement after revision.
Leveraging Video Descriptions to Learn Video Question Answering
Zeng, Kuo-Hao (Stanford University and National Tsing Hua University) | Chen, Tseng-Hung (National Tsing Hua University) | Chuang, Ching-Yao (National Tsing Hua University) | Liao, Yuan-Hong (National Tsing Hua University) | Niebles, Juan Carlos (Stanford University) | Sun, Min (National Tsing Hua University)
We propose a scalable approach to learn video-based question answering (QA): to answer a free-form natural language question about the contents of a video. Our approach automatically harvests a large number of videos and descriptions freely available online. Then, a large number of candidate QA pairs are automatically generated from descriptions rather than manually annotated. Next, we use these candidate QA pairs to train a number of video-based QA methods extended from MN (Sukhbaatar et al. 2015), VQA (Antol et al. 2015), SA (Yao et al. 2015), and SS (Venugopalan et al. 2015). In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training. Finally, we evaluate performance on manually generated video-based QA pairs. The results show that our self-paced learning procedure is effective, and the extended SS model outperforms various baselines.
Generate Believable Causal Plots with User Preferences Using Constrained Monte Carlo Tree Search
Soo, Von-Wun (National Tsing Hua University) | Lee, Chi-Mou (National Tsing Hua University) | Chen, Tai-Hsun (National Tsing Hua University)
We construct a large scale of causal knowledge in term of Fabula elements by extracting causal links from existing common sense ontology ConceptNet5. We design a Constrained Monte Carlo Tree Search (cMCTS) algorithm that allows users to specify positive and negative concepts to appear in the generated stories. cMCTS can find a believable causal story plot. We show the merits by experiments and discuss the remedy strategies in cMCTS that may generate incoherent causal plots.
Learning Interrogation Strategies while Considering Deceptions in Detective Interactive Stories
Chen, Guan-Yi (National Tsing Hua University) | Kao, Edward C.-C. (National Tsing Hua University) | Soo, Von-Wun (National Tsing Hua University)
The strategies for interactive characters to select appropriate dialogues remain as an open issue in related research areas. In this paper we propose an approach based on reinforcement learning to learn the strategy of interrogation dialogue from one virtual agent toward another. The emotion variation of the suspect agent is modeled with a hazard function, and the detective agent must learn its interrogation strategies based on the emotion state of the suspect agent. The reinforcement learning reward schemes are evaluated to choose the proper reward in the dialogue. Our contribution is twofold. Firstly, we proposed a new framework of reinforcement learning to model dialogue strategies. Secondly, background knowledge and emotion states of agents are brought into the dialogue strategies. The resulted dialogue strategy in our experiment is sensitive in detecting lies from the suspect, and with it the interrogator may receive more correct answer.