Shen, Fumin
Exact Adversarial Attack to Image Captioning via Structured Output Learning with Latent Variables
Xu, Yan, Wu, Baoyuan, Shen, Fumin, Fan, Yanbo, Zhang, Yong, Shen, Heng Tao, Liu, Wei
In this work, we study the robustness of a CNN+RNN based image captioning system being subjected to adversarial noises. We propose to fool an image captioning system to generate some targeted partial captions for an image polluted by adversarial noises, even the targeted captions are totally irrelevant to the image content. A partial caption indicates that the words at some locations in this caption are observed, while words at other locations are not restricted.It is the first work to study exact adversarial attacks of targeted partial captions. Due to the sequential dependencies among words in a caption, we formulate the generation of adversarial noises for targeted partial captions as a structured output learning problem with latent variables. Both the generalized expectation maximization algorithm and structural SVMs with latent variables are then adopted to optimize the problem. The proposed methods generate very successful at-tacks to three popular CNN+RNN based image captioning models. Furthermore, the proposed attack methods are used to understand the inner mechanism of image captioning systems, providing the guidance to further improve automatic image captioning systems towards human captioning.
Discovering and Distinguishing Multiple Visual Senses for Polysemous Words
Yao, Yazhou (University of Technology Sydney) | Zhang, Jian (University of Technology Sydney) | Shen, Fumin (University of Electronic Science and Technology of China) | Yang, Wankou (Southeast University) | Huang, Pu (Nanjing University of Posts and Telecommunications) | Tang, Zhenmin (Nanjing University of Science and Technology)
To reduce the dependence on labeled data, there have been increasing research efforts on learning visual classifiers by exploiting web images. One issue that limits their performance is the problem of polysemy. To solve this problem, in this work, we present a novel framework that solves the problem of polysemy by allowing sense-specific diversity in search results. Specifically, we first discover a list of possible semantic senses to retrieve sense-specific images. Then we merge visual similar semantic senses and prune noises by using the retrieved images. Finally, we train a visual classifier for each selected semantic sense and use the learned sense-specific classifiers to distinguish multiple visual senses. Extensive experiments on classifying images into sense-specific categories and re-ranking search results demonstrate the superiority of our proposed approach.
Compressed K-Means for Large-Scale Clustering
Shen, Xiaobo (Nanjing University of Science and Technology) | Liu, Weiwei (University of Technology Sydney) | Tsang, Ivor (University of Technology Sydney) | Shen, Fumin (University of Electronic Science and Technology of China) | Sun, Quan-Sen (Nanjing University of Science and Technology)
Large-scale clustering has been widely used in many applications, and has received much attention. Most existing clustering methods suffer from both expensive computation and memory costs when applied to large-scale datasets. In this paper, we propose a novel clustering method, dubbed compressed k-means (CKM), for fast large-scale clustering. Specifically, high-dimensional data are compressed into short binary codes, which are well suited for fast clustering. CKM enjoys two key benefits: 1) storage can be significantly reduced by representing data points as binary codes; 2) distance computation is very efficient using Hamming metric between binary codes. We propose to jointly learn binary codes and clusters within one framework. Extensive experimental results on four large-scale datasets, including two million-scale datasets demonstrate that CKM outperforms the state-of-the-art large-scale clustering methods in terms of both computation and memory cost, while achieving comparable clustering accuracy.