Multi-scale deep CNN architecture [1, 2, 3] successfully captures both fine and coarse level image descriptors for visual similarity task, but they come up with expensive memory overhead and latency. In this paper, we propose a competing novel CNN architecture, called MILDNet, which merits by being vastly compact (about 3 times). Inspired by the fact that successive CNN layers represent the image with increasing levels of abstraction, we compressed our deep ranking model to a single CNN by coupling activations from multiple intermediate layers along with the last layer. Trained on the famous Street2shop dataset , we demonstrate that our approach performs as good as the current state-of-the-art models with only one third of the parameters, model size, training time and significant reduction in inference time. The significance of intermediate layers on image retrieval task has also been shown to be performing on popular datasets Holidays, Oxford, Paris . So even though our experiments are done on ecommerce domain, it is applicable to other domains as well. We further did an ablation study to validate our hypothesis by checking the impact on adding each intermediate layer. With this we also present two more useful variants of MILDNet, a mobile model (12 times smaller) for on-edge devices and a compactly featured model (512-d feature embeddings) for systems with less RAMs and to reduce the ranking cost. Further we present an intuitive way to automatically create a tailored in-house triplet training dataset, which is very hard to create manually. This solution too can also be deployed as an all-inclusive visual similarity solution. Finally, we present our entire production level architecture which currently powers visual similarity at Fynd.
Few-shot learning refers to understanding new concepts from only a few examples. We propose an information retrieval-inspired approach for this problem that is motivated by the increased importance of maximally leveraging all the available information in this low-data regime. We define a training objective that aims to extract as much information as possible from each training batch by effectively optimizing over all relative orderings of the batch points simultaneously. In particular, we view each batch point as a `query' that ranks the remaining ones based on its predicted relevance to them and we define a model within the framework of structured prediction to optimize mean Average Precision over these rankings. Our method achieves impressive results on the standard few-shot classification benchmarks while is also capable of few-shot retrieval.
Few-shot models have become a popular topic of research in the past years. They offer the possibility to determine class belongings for unseen examples using just a handful of examples for each class. Such models are trained on a wide range of classes and their respective examples, learning a decision metric in the process. Types of few-shot models include matching networks and prototypical networks. We show a new way of training prototypical few-shot models for just a single class. These models have the ability to predict the likelihood of an unseen query belonging to a group of examples without any given counterexamples. The difficulty here lies in the fact that no relative distance to other classes can be calculated via softmax. We solve this problem by introducing a "null class" centered around zero, and enforcing centering with batch normalization. Trained on the commonly used Omniglot data set, we obtain a classification accuracy of .98 on the matched test set, and of .8 on unmatched MNIST data. On the more complex MiniImageNet data set, test accuracy is .8. In addition, we propose a novel Gaussian layer for distance calculation in a prototypical network, which takes the support examples' distribution rather than just their centroid into account. This extension shows promising results when a higher number of support examples is available.
During the last years, many advances have been made in tasks like3D model retrieval, 3D model classification, and 3D model segmentation.The typical 3D representations such as point clouds, voxels, and poly-gon meshes are mostly suitable for rendering purposes, while their use forcognitive processes (retrieval, classification, segmentation) is limited dueto their high redundancy and complexity. We propose a deep learningarchitecture to handle 3D models as an input. We combine this architec-ture with other standard architectures like Convolutional Neural Networksand autoencoders for computing 3D model embeddings. Our goal is torepresent a 3D model as a vector with enough information to substitutethe 3D model for high-level tasks. Since this vector is a learned repre-sentation which tries to capture the relevant information of a 3D model,we show that the embedding representation conveys semantic informationthat helps to deal with the similarity assessment of 3D objects. Our ex-periments show the benefit of computing the embeddings of a 3D modeldata set and use them for effective 3D Model Retrieval.
Classification systems typically act in isolation, meaning they are required to implicitly memorize the characteristics of all candidate classes in order to classify. The cost of this is increased memory usage and poor sample efficiency. We propose a model which instead verifies using reference images during the classification process, reducing the burden of memorization. The model uses iterative nondifferentiable queries in order to classify an image. We demonstrate that such a model is feasible to train and can match baseline accuracy while being more parameter efficient. However, we show that finding the correct balance between image recognition and verification is essential to pushing the model towards desired behavior, suggesting that a pipeline of recognition followed by verification is a more promising approach.