"I am alarmed," wrote Henry David Thoreau in "Walking," his 1862 essay, "when it happens that I have walked a mile into the woods bodily, without getting there in spirit." The point of his saunter had been to "forget all my morning occupations, and my obligations to society." Alas: "It sometimes happens I cannot easily shake off the village." With a gentle lashing of self -reproach, he asks: "What business have I in the woods, if I am thinking of something out of the woods?" Thoreau was surely being dogmatic: Must one only think arboreal thoughts on a tree-lined path?
Attention mechanisms have been widely applied in the Visual Question Answering (VQA) task, as they help to focus on the area-of-interest of both visual and textual information. To answer the questions correctly, the model needs to selectively target different areas of an image, which suggests that an attention-based model may benefit from an explicit attention supervision. In this work, we aim to address the problem of adding attention supervision to VQA models. Since there is a lack of human attention data, we first propose a Human Attention Network (HAN) to generate human-like attention maps, training on a recently released dataset called Human ATtention Dataset (VQA-HAT). Then, we apply the pre-trained HAN on the VQA v2.0 dataset to automatically produce the human-like attention maps for all image-question pairs. The generated human-like attention map dataset for the VQA v2.0 dataset is named as Human-Like ATtention (HLAT) dataset. Finally, we apply human-like attention supervision to an attention-based VQA model. The experiments show that adding human-like supervision yields a more accurate attention together with a better performance, showing a promising future for human-like attention supervision in VQA.
Tim Wu is an expert on concentrations of power. An author, activist and lawyer, he is most famous for coining the phrase "net neutrality" – the idea that the oligopoly that owns our internet infrastructure shouldn't charge differently for different kinds of data. In his new book, he targets another kind of corporate domination: the industry that monopolises our attention. According to Wu, this industry emerged from the first world war. In 1914 Germany could mobilise 4.5 million men; the best Britain could do was 700,000.
Table-to-text generation aims to generate a description for a factual table which can be viewed as a set of field-value records. To encode both the content and the structure of a table, we propose a novel structure-aware seq2seq architecture which consists of field-gating encoder and description generator with dual attention. In the encoding phase, we update the cell memory of the LSTM unit by a field gate and its corresponding field value in order to incorporate field information into table representation. In the decoding phase, dual attention mechanism which contains word level attention and field level attention is proposed to model the semantic relevance between the generated description and the table. We conduct experiments on the WIKIBIO dataset which contains over 700k biographies and corresponding infoboxes from Wikipedia. The attention visualizations and case studies show that our model is capable of generating coherent and informative descriptions based on the comprehensive understanding of both the content and the structure of a table. Automatic evaluations also show our model outperforms the baselines by a great margin. Code for this work is available on https://github.com/tyliupku/wiki2bio.
Early knowledge based systems did not incorporate high-bandwidth I/O due to performance limitations of computers of that era. Today, intelligent agents and robots running on much more powerful computers can incorporate vision, sound, network, sonar and other modes of input. These additional inputs provide much more information about the environment, but bring additional problems related to control of perception. Perceptual input streams (called modes in the psychology literature) can have greatly varying bandwidth. In people, the sense of touch has a low bandwidth, while the sense of vision has a very high bandwidth.