Multiple Pretext-Task for Self-Supervised Learning via Mixing Multiple Image Transformations
Yamaguchi, Shin'ya, Kanai, Sekitoshi, Shioda, Tetsuya, Takeda, Shoichiro
Multiple Pretext-T ask for Self-Supervised Learning via Mixing Multiple Image Transformations Shin'ya Y amaguchi, Sekitoshi Kanai, Tetsuya Shioda, Shoichiro Takeda NTT Tokyo, Japan {shinya.yamaguchi.mw,sekitoshi.kanai.fu,tetsuya.shioda.yf,shoichiro.takeda.us}@hco.ntt.co.jp Abstract Self-supervised learning is one of the most promising approaches to learn representations capturing semantic features in images without any manual annotation cost. T o learn useful representations, a self-supervised model solves a pretext-task, which is defined by data itself. Among a number of pretext-tasks, the rotation prediction task (Rotation) achieves better representations for solving various target tasks despite its simplicity of the implementation. However, we found that Rotation can fail to capture semantic features related to image textures and colors. T o tackle this problem, we introduce a learning technique called multiple pretext-task for self-supervised learning (MP-SSL), which solves multiple pretext-task in addition to Rotation simultaneously. In order to capture features of textures and colors, we employ the transformations of image enhancements (e.g., sharpening and solarizing) as the additional pretext-tasks. MP-SSL efficiently trains a model by leveraging a Frank-W olfe based multi-task training algorithm. Our experimental results show MP-SSL models outperform Rotation on multiple standard benchmarks and achieve state-of- the-art performance on Places-205. 1. Introduction Convolutional neural networks (CNNs) [27, 16, 44] are widely adopted to solve many target tasks in applications of computer vision such as object recognition [30], semantic segmentation [4], and object detection [42]. However, these successes depend on supervised training of CNNs with the vast amount of labeled data [43], which is expensive and impractical because of the manual annotation cost. Since the cost of labeled data limits the practical applications of CNNs, a number of researches focus on the training techniques to alleviate the requirement of many labeled data; the techniques include transfer learning, semi-supervised learning, and self-supervised learning . A demonstration describing our motivation to modify self-supervised learning by predicting rotations of images (Rotation).
Dec-25-2019