Natural Language Processing (NLP) has tremendous real-world applications in information extraction, natural language understanding, and natural language generation. Comparing the similarity between natural language texts is essential to many information extraction applications such as Google search, Spotify's Podcast search, Home Depot's product search, etc. The semantic textual similarity (STS) problem attempts to compare two texts and decide whether they are similar in meaning. It was a notoriously hard problem due to the nuances of natural language where two texts could be similar despite not having a single word in common! While this challenge has existed for a long time, recent advancements in NLP have opened up many algorithms to tackle it. Some methods take in two texts as input and directly provide a score on how similar the two texts are.
Multi-scale deep CNN architecture [1, 2, 3] successfully captures both fine and coarse level image descriptors for visual similarity task, but they come up with expensive memory overhead and latency. In this paper, we propose a competing novel CNN architecture, called MILDNet, which merits by being vastly compact (about 3 times). Inspired by the fact that successive CNN layers represent the image with increasing levels of abstraction, we compressed our deep ranking model to a single CNN by coupling activations from multiple intermediate layers along with the last layer. Trained on the famous Street2shop dataset , we demonstrate that our approach performs as good as the current state-of-the-art models with only one third of the parameters, model size, training time and significant reduction in inference time. The significance of intermediate layers on image retrieval task has also been shown to be performing on popular datasets Holidays, Oxford, Paris . So even though our experiments are done on ecommerce domain, it is applicable to other domains as well. We further did an ablation study to validate our hypothesis by checking the impact on adding each intermediate layer. With this we also present two more useful variants of MILDNet, a mobile model (12 times smaller) for on-edge devices and a compactly featured model (512-d feature embeddings) for systems with less RAMs and to reduce the ranking cost. Further we present an intuitive way to automatically create a tailored in-house triplet training dataset, which is very hard to create manually. This solution too can also be deployed as an all-inclusive visual similarity solution. Finally, we present our entire production level architecture which currently powers visual similarity at Fynd.
The 2020 European Conference on Computer Vision took place online, from 23 to 28 August, and consisted of 1360 papers, divided into 104 orals, 160 spotlights and the rest of 1096 papers as posters. As it is the case in recent years with ML and CV conferences, the huge number of papers can be overwhelming at times. Similar to my CVPR2020 post, to get a grasp of the general trends of the conference this year, I will present in this blog post a sort of a snapshot of the conference by summarizing some papers (& listing some) that grabbed my attention. Disclaimer: This post is not a representation of the papers and subjects presented in ECCV 2020; it is just a personnel overview of what I found interesting. The statistics presented in this section are taken from the official Opening & Awards presentation. Let's start by some general statistics: The trends of earlier years continued with more than 200% increase in submitted papers compared to the 2018 conference, and with a similar number of papers to CVPR 2020. As expected, this increase is joined by a corresponding increase in the number of reviewers and area chairs to accommodate this expansion. As expected, the majority of the accepted papers focus on topics related to deep learning, recognition, detection, and understanding. Similar to CVPR 2020, we see an increasing interest in growing areas such as label-efficient methods (e.g., unsupervised learning) and low-level vision. In terms of institutions; similar to ICML this year, Google takes the lead with 180 authors, followed by The Chinese University of Hong Kong with 140 authors and Peking University with 110 authors. In the next sections, we'll present some paper summaries by subject. The task of object detection consists of localizing and classifying objects visible given an input image. The popular framework for object detection consist of pre-defining a set of boxes (ie., a set of geometric priors like anchors or region proposals), which are first classified, followed by a regression step to the adjust the dimensions of the predefined box, and then a post-processing step to remove duplicate predictions.
We present a similar image retrieval (SIR) platform that is used to quickly discover visually similar products in a catalog of millions. Given the size, diversity, and dynamism of our catalog, product search poses many challenges. It can be addressed by building supervised models to tagging product images with labels representing themes and later retrieving them by labels. This approach suffices for common and perennial themes like "white shirt" or "lifestyle image of TV". It does not work for new themes such as "e-cigarettes", hard-to-define ones such as "image with a promotional badge", or the ones with short relevance span such as "Halloween costumes". SIR is ideal for such cases because it allows us to search by an example, not a pre-defined theme. We describe the steps - embedding computation, encoding, and indexing - that power the approximate nearest neighbor search back-end. We also highlight two applications of SIR. The first one is related to the detection of products with various types of potentially objectionable themes. This application is run with a sense of urgency, hence the typical time frame to train and bootstrap a model is not permitted. Also, these themes are often short-lived based on current trends, hence spending resources to build a lasting model is not justified. The second application is a variant item detection system where SIR helps discover visual variants that are hard to find through text search. We analyze the performance of SIR in the context of these applications.
There has been significant interest recently in learning multilingual word embeddings -- in which semantically similar words across languages have similar embeddings. State-of-the-art approaches have relied on expensive labeled data, which is unavailable for low-resource languages, or have involved post-hoc unification of monolingual embeddings. In the present paper, we investigate the efficacy of multilingual embeddings learned from weakly-supervised image-text data. In particular, we propose methods for learning multilingual embeddings using image-text data, by enforcing similarity between the representations of the image and that of the text. Our experiments reveal that even without using any expensive labeled data, a bag-of-words-based embedding model trained on image-text data achieves performance comparable to the state-of-the-art on crosslingual semantic similarity tasks.