This article is about most probably the next generation of neural networks for all computer vision applications: The transformer architecture. You've certainly already heard about this architecture in the field of natural language processing, or NLP, mainly with GPT3 that made a lot of noise in 2020. Transformers can be used as a general-purpose backbone for many different applications and not only NLP. In a couple of minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer by Ze Lio et al. from Microsoft Research . This article may be less flashy than usual as it doesn't really show the actual results of a precise application.
Transformer and its numerous variants achieve excellent performance today in various machine learning applications including sequence-to-sequence modeling, language modeling and computer vision tasks. The baseline transformer is still one of the most common choices for language modeling. Most transformer architectures comprise a basic transformer block in both the encoder and the decoder parts. A basic transformer block employs several layers of multi-head attention-based mechanisms to perform its task. One of the major differences between the transformer variants and the baseline transformer is the number of multi-head attention layers they incorporate.
Facebook AI has built a new architecture for video understanding called TimeSformer. The video architecture is purely based on Transformers. Transformers have become the dominant approach for many natural language processing (NLP) applications such as Machine Translation and General language understanding. TimeSformer was proven to achieve the best-reported numbers on multiple challenging action recognition benchmarks, including the Kinetics-400 action recognition data set. Compared with modern 3D convolutional neural networks, it is nearly three times faster to train requires less than one-tenth of computing inference.
The famous paper "Attention is all you need" in 2017 changed the way we were thinking about attention. Nonetheless, 2020 was definitely the year of transformers! From natural language now they are into computer vision tasks. How did we go from attention to self-attention? Why does the transformer work so damn well? What are the critical components for its success? Read on and find out! In my opinion, transformers are not so hard to grasp.
Real-world data often exhibit imbalanced distributions, where certain target values have significantly fewer observations. Existing techniques for dealing with imbalanced data focus on targets with categorical indices, i.e., different classes. However, many tasks involve continuous targets, where hard boundaries between classes do not exist. We define Deep Imbalanced Regression (DIR) as learning from such imbalanced data with continuous targets, dealing with potential missing data for certain target values, and generalizing to the entire target range. Motivated by the intrinsic difference between categorical and continuous label space, we propose distribution smoothing for both labels and features, which explicitly acknowledges the effects of nearby targets, and calibrates both label and learned feature distributions. We curate and benchmark large-scale DIR datasets from common real-world tasks in computer vision, natural language processing, and healthcare domains. Extensive experiments verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for practical imbalanced regression problems. Code and data are available at https://github.com/YyzHarry/imbalanced-regression.
Recent years have witnessed an increasing attention around style transfer problems (Gatys et al., 2016; Wynen et al., 2018; Jing et al., 2020) in machine learning. In a nutshell, style transfer refers to the transformation of an object according to a target style. It has found numerous applications in computer vision (Ulyanov et al., 2016; Choi et al., 2018; Puy and Pérez, 2019; Yao et al., 2020), natural language processing (Fu et al., 2018) as well as audio signal processing (Grinstein et al., 2018) where objects at hand are contents in which style is inherently part of their perception. Style transfer is one of the key components of data augmentation (Mikołajczyk and Grochowski, 2018) as a means to artificially generate meaningful additional data for the training of deep neural networks. Besides, it has also been shown to be useful for counterbalancing bias in data by producing stylized contents with a well-chosen style (see for instance Geirhos et al. (2019)) in image recognition.
Assessing advertisements, specifically on the basis of user preferences and ad quality, is crucial to the marketing industry. Although recent studies have attempted to use deep neural networks for this purpose, these studies have not utilized image-related auxiliary attributes, which include embedded text frequently found in ad images. We, therefore, investigated the influence of these attributes on ad image preferences. First, we analyzed large-scale real-world ad log data and, based on our findings, proposed a novel multi-step modality fusion network (M2FN) that determines advertising images likely to appeal to user preferences. Our method utilizes auxiliary attributes through multiple steps in the network, which include conditional batch normalization-based low-level fusion and attention-based high-level fusion. We verified M2FN on the AVA dataset, which is widely used for aesthetic image assessment, and then demonstrated that M2FN can achieve state-of-the-art performance in preference prediction using a real-world ad dataset with rich auxiliary attributes.
When a bug manifests in a user-facing application, it is likely to be exposed through the graphical user interface (GUI). Given the importance of visual information to the process of identifying and understanding such bugs, users are increasingly making use of screenshots and screen-recordings as a means to report issues to developers. However, when such information is reported en masse, such as during crowd-sourced testing, managing these artifacts can be a time-consuming process. As the reporting of screen-recordings in particular becomes more popular, developers are likely to face challenges related to manually identifying videos that depict duplicate bugs. Due to their graphical nature, screen-recordings present challenges for automated analysis that preclude the use of current duplicate bug report detection techniques. To overcome these challenges and aid developers in this task, this paper presents Tango, a duplicate detection technique that operates purely on video-based bug reports by leveraging both visual and textual information. Tango combines tailored computer vision techniques, optical character recognition, and text retrieval. We evaluated multiple configurations of Tango in a comprehensive empirical evaluation on 4,860 duplicate detection tasks that involved a total of 180 screen-recordings from six Android apps. Additionally, we conducted a user study investigating the effort required for developers to manually detect duplicate video-based bug reports and compared this to the effort required to use Tango. The results reveal that Tango's optimal configuration is highly effective at detecting duplicate video-based bug reports, accurately ranking target duplicate videos in the top-2 returned results in 83% of the tasks. Additionally, our user study shows that, on average, Tango can reduce developer effort by over 60%, illustrating its practicality.
Many adversarial attacks and defenses have recently been proposed for Deep Neural Networks (DNNs). While most of them are in the white-box setting, which is impractical, a new class of query-based hard-label (QBHL) black-box attacks pose a significant threat to real-world applications (e.g., Google Cloud, Tencent API). Till now, there has been no generalizable and practical approach proposed to defend against such attacks. This paper proposes and evaluates PredCoin, a practical and generalizable method for providing robustness against QBHL attacks. PredCoin poisons the gradient estimation step, an essential component of most QBHL attacks. PredCoin successfully identifies gradient estimation queries crafted by an attacker and introduces uncertainty to the output. Extensive experiments show that PredCoin successfully defends against four state-of-the-art QBHL attacks across various settings and tasks while preserving the target model's overall accuracy. PredCoin is also shown to be robust and effective against several defense-aware attacks, which may have full knowledge regarding the internal mechanisms of PredCoin.
This paper considers a video caption generating network referred to as Semantic Grouping Network (SGN) that attempts (1) to group video frames with discriminating word phrases of partially decoded caption and then (2) to decode those semantically aligned groups in predicting the next word. As consecutive frames are not likely to provide unique information, prior methods have focused on discarding or merging repetitive information based only on the input video. The SGN learns an algorithm to capture the most discriminating word phrases of the partially decoded caption and a mapping that associates each phrase to the relevant video frames - establishing this mapping allows semantically related frames to be clustered, which reduces redundancy. In contrast to the prior methods, the continuous feedback from decoded words enables the SGN to dynamically update the video representation that adapts to the partially decoded caption. Furthermore, a contrastive attention loss is proposed to facilitate accurate alignment between a word phrase and video frames without manual annotations. The SGN achieves state-of-the-art performances by outperforming runner-up methods by a margin of 2.1%p and 2.4%p in a CIDEr-D score on MSVD and MSR-VTT datasets, respectively. Extensive experiments demonstrate the effectiveness and interpretability of the SGN.