Goto

Collaborating Authors

 simsiam


Contrastive Self-Supervised Learning at the Edge: An Energy Perspective

Famá, Fernanda, Pereira, Roberto, Kalalas, Charalampos, Dini, Paolo, Qendro, Lorena, Kawsar, Fahim, Malekzadeh, Mohammad

arXiv.org Artificial Intelligence

Abstract--While contrastive learning (CL) shows considerable promise in self-supervised representation learning, its deployment on resource-constrained devices remains largely underexplored. The substantial computational demands required for training conventional CL frameworks pose a set of challenges, particularly in terms of energy consumption, data availability, and memory usage. We conduct an evaluation of four widely used CL frameworks: SimCLR, MoCo, SimSiam, and Barlow Twins. We focus on the practical feasibility of these CL frameworks for edge and fog deployment, and introduce a systematic benchmarking strategy that includes energy profiling and reduced training data conditions. Our findings reveal that SimCLR, contrary to its perceived computational cost, demonstrates the lowest energy consumption across various data regimes. Finally, we also extend our analysis by evaluating lightweight neural architectures when paired with CL frameworks. Our study aims to provide insights into the resource implications of deploying CL in edge/fog environments with limited processing capabilities and opens several research directions for its future optimization. Over the years, a variety of contrastive learning (CL) approaches have been developed, including popular frameworks such as SimCLR [1], MoCo [2], BYOL [3], SimSiam [4], and Barlow Twins [5], each offering specific advantages and trade-offs. These frameworks aim to learn representations by distinguishing between similar (positive) and dissimilar (negative) samples in a latent space. While some methods rely on large negative sample sets to achieve high-quality representations, others bypass the need for negative pairs through momentum encoders or predictor networks.


Self-Supervised Representation Learning as Mutual Information Maximization

Sabby, Akhlaqur Rahman, Sui, Yi, Wu, Tongzi, Cresswell, Jesse C., Wu, Ga

arXiv.org Artificial Intelligence

Self-supervised representation learning (SSRL) has demonstrated remarkable empirical success, yet its underlying principles remain insufficiently understood. While recent works attempt to unify SSRL methods by examining their information-theoretic objectives or summarizing their heuristics for preventing representation collapse, architectural elements like the predictor network, stop-gradient operation, and statistical regularizer are often viewed as empirically motivated additions. In this paper, we adopt a first-principles approach and investigate whether the learning objective of an SSRL algorithm dictates its possible optimization strategies and model design choices. In particular, by starting from a variational mutual information (MI) lower bound, we derive two training paradigms, namely Self-Distillation MI (SDMI) and Joint MI (JMI), each imposing distinct structural constraints and covering a set of existing SSRL algorithms. SDMI inherently requires alternating optimization, making stop-gradient operations theoretically essential. In contrast, JMI admits joint optimization through symmetric architectures without such components. Under the proposed formulation, predictor networks in SDMI and statistical regularizers in JMI emerge as tractable surrogates for the MI objective. We show that many existing SSRL methods are specific instances or approximations of these two paradigms. This paper provides a theoretical explanation behind the choices of different architectural components of existing SSRL methods, beyond heuristic conveniences.


Replay-free Online Continual Learning with Self-Supervised MultiPatches

Cignoni, Giacomo, Cossu, Andrea, Gomez-Villa, Alex, van de Weijer, Joost, Carta, Antonio

arXiv.org Artificial Intelligence

Online Continual Learning (OCL) methods train a model on a non-stationary data stream where only a few examples are available at a time, often leveraging replay strategies. However, usage of replay is sometimes forbidden, especially in applications with strict privacy regulations. Therefore, we propose Continual MultiPatches (CMP), an effective plug-in for existing OCL self-supervised learning strategies that avoids the use of replay samples. CMP generates multiple patches from a single example and projects them into a shared feature space, where patches coming from the same example are pushed together without collapsing into a single point. CMP surpasses replay and other SSL-based strategies on OCL streams, challenging the role of replay as a go-to solution for self-supervised OCL.


On the Importance of Embedding Norms in Self-Supervised Learning

Draganov, Andrew, Vadgama, Sharvaree, Damrich, Sebastian, Böhm, Jan Niklas, Maes, Lucas, Kobak, Dmitry, Bekkers, Erik

arXiv.org Artificial Intelligence

Self-supervised learning (SSL) allows training data representations without a supervised signal and has become an important paradigm in machine learning. Most SSL methods employ the cosine similarity between embedding vectors and hence effectively embed data on a hypersphere. While this seemingly implies that embedding norms cannot play any role in SSL, a few recent works have suggested that embedding norms have properties related to network convergence and confidence. In this paper, we resolve this apparent contradiction and systematically establish the embedding norm's role in SSL training. Using theoretical analysis, simulations, and experiments, we show that embedding norms (i) govern SSL convergence rates and (ii) encode network confidence, with smaller norms corresponding to unexpected samples. Additionally, we show that manipulating embedding norms can have large effects on convergence speed. Our findings demonstrate that SSL embedding norms are integral to understanding and optimizing network behavior.


Label-free Monitoring of Self-Supervised Learning Progress

Xu, Isaac, Lowe, Scott, Trappenberg, Thomas

arXiv.org Artificial Intelligence

Self-supervised learning (SSL) is an effective method for exploiting unlabelled data to learn a high-level embedding space that can be used for various downstream tasks. However, existing methods to monitor the quality of the encoder -- either during training for one model or to compare several trained models -- still rely on access to annotated data. When SSL methodologies are applied to new data domains, a sufficiently large labelled dataset may not always be available. In this study, we propose several evaluation metrics which can be applied on the embeddings of unlabelled data and investigate their viability by comparing them to linear probe accuracy (a common metric which utilizes an annotated dataset). In particular, we apply $k$-means clustering and measure the clustering quality with the silhouette score and clustering agreement. We also measure the entropy of the embedding distribution. We find that while the clusters did correspond better to the ground truth annotations as training of the network progressed, label-free clustering metrics correlated with the linear probe accuracy only when training with SSL methods SimCLR and MoCo-v2, but not with SimSiam. Additionally, although entropy did not always have strong correlations with LP accuracy, this appears to be due to instability arising from early training, with the metric stabilizing and becoming more reliable at later stages of learning. Furthermore, while entropy generally decreases as learning progresses, this trend reverses for SimSiam. More research is required to establish the cause for this unexpected behaviour. Lastly, we find that while clustering based approaches are likely only viable for same-architecture comparisons, entropy may be architecture-independent.


PhiNets: Brain-inspired Non-contrastive Learning Based on Temporal Prediction Hypothesis

Ishikawa, Satoki, Yamada, Makoto, Bao, Han, Takezawa, Yuki

arXiv.org Artificial Intelligence

SimSiam is a prominent self-supervised learning method that achieves impressive results in various vision tasks under static environments. However, it has two critical issues: high sensitivity to hyperparameters, especially weight decay, and unsatisfactory performance in online and continual learning, where neuroscientists believe that powerful memory functions are necessary, as in brains. In this paper, we propose PhiNet, inspired by a hippocampal model based on the temporal prediction hypothesis. Unlike SimSiam, which aligns two augmented views of the original image, PhiNet integrates an additional predictor block that estimates the original image representation to imitate the CA1 region in the hippocampus. Moreover, we model the neocortex inspired by the Complementary Learning Systems theory with a momentum encoder block as a slow learner, which works as long-term memory. We demonstrate through analysing the learning dynamics that PhiNet benefits from the additional predictor to prevent the complete collapse of learned representations, a notorious challenge in non-contrastive learning. This dynamics analysis may partially corroborate why this hippocampal model is biologically plausible. Experimental results demonstrate that PhiNet is more robust to weight decay and performs better than SimSiam in memory-intensive tasks like online and continual learning.


Real-world Instance-specific Image Goal Navigation for Service Robots: Bridging the Domain Gap with Contrastive Learning

Sakaguchi, Taichi, Taniguchi, Akira, Hagiwara, Yoshinobu, Hafi, Lotfi El, Hasegawa, Shoichi, Taniguchi, Tadahiro

arXiv.org Artificial Intelligence

Improving instance-specific image goal navigation (InstanceImageNav), which locates the identical object in a real-world environment from a query image, is essential for robotic systems to assist users in finding desired objects. The challenge lies in the domain gap between low-quality images observed by the moving robot, characterized by motion blur and low-resolution, and high-quality query images provided by the user. Such domain gaps could significantly reduce the task success rate but have not been the focus of previous work. To address this, we propose a novel method called Few-shot Cross-quality Instance-aware Adaptation (CrossIA), which employs contrastive learning with an instance classifier to align features between massive low- and few high-quality images. This approach effectively reduces the domain gap by bringing the latent representations of cross-quality images closer on an instance basis. Additionally, the system integrates an object image collection with a pre-trained deblurring model to enhance the observed image quality. Our method fine-tunes the SimSiam model, pre-trained on ImageNet, using CrossIA. We evaluated our method's effectiveness through an InstanceImageNav task with 20 different types of instances, where the robot identifies the same instance in a real-world environment as a high-quality query image. Our experiments showed that our method improves the task success rate by up to three times compared to the baseline, a conventional approach based on SuperGlue. These findings highlight the potential of leveraging contrastive learning and image enhancement techniques to bridge the domain gap and improve object localization in robotic applications. The project website is https://emergentsystemlabstudent.github.io/DomainBridgingNav/.


Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map

Sakaguchi, Taichi, Taniguchi, Akira, Hagiwara, Yoshinobu, Hafi, Lotfi El, Hasegawa, Shoichi, Taniguchi, Tadahiro

arXiv.org Artificial Intelligence

Robots that assist in daily life are required to locate specific instances of objects that match the user's desired object in the environment. This task is known as Instance-Specific Image Goal Navigation (InstanceImageNav), which requires a model capable of distinguishing between different instances within the same class. One significant challenge in robotics is that when a robot observes the same object from various 3D viewpoints, its appearance may differ greatly, making it difficult to recognize and locate the object accurately. In this study, we introduce a method, SimView, that leverages multi-view images based on a 3D semantic map of the environment and self-supervised learning by SimSiam to train an instance identification model on-site. The effectiveness of our approach is validated using a photorealistic simulator, Habitat Matterport 3D, created by scanning real home environments. Our results demonstrate a 1.7-fold improvement in task accuracy compared to CLIP, which is pre-trained multimodal contrastive learning for object search. This improvement highlights the benefits of our proposed fine-tuning method in enhancing the performance of assistive robots in InstanceImageNav tasks. The project website is https://emergentsystemlabstudent.github.io/MultiViewRetrieve/.


The Common Stability Mechanism behind most Self-Supervised Learning Approaches

Jha, Abhishek, Blaschko, Matthew B., Asano, Yuki M., Tuytelaars, Tinne

arXiv.org Artificial Intelligence

Last couple of years have witnessed a tremendous progress in self-supervised learning (SSL), the success of which can be attributed to the introduction of useful inductive biases in the learning process to learn meaningful visual representations while avoiding collapse. These inductive biases and constraints manifest themselves in the form of different optimization formulations in the SSL techniques, e.g. by utilizing negative examples in a contrastive formulation, or exponential moving average and predictor in BYOL and SimSiam. In this paper, we provide a framework to explain the stability mechanism of these different SSL techniques: i) we discuss the working mechanism of contrastive techniques like SimCLR, non-contrastive techniques like BYOL, SWAV, SimSiam, Barlow Twins, and DINO; ii) we provide an argument that despite different formulations these methods implicitly optimize a similar objective function, i.e. minimizing the magnitude of the expected representation over all data samples, or the mean of the data distribution, while maximizing the magnitude of the expected representation of individual samples over different data augmentations; iii) we provide mathematical and empirical evidence to support our framework. We formulate different hypotheses and test them using the Imagenet100 dataset.


Hard View Selection for Self-Supervised Learning

Ferreira, Fabio, Rapant, Ivo, Hutter, Frank

arXiv.org Artificial Intelligence

Many Self-Supervised Learning (SSL) methods train their models to be invariant to different "views" of an image and considerable efforts were directed towards improving pre-text tasks, architectures, or robustness. However, most SSL methods remain reliant on the random sampling of operations within the image augmentation pipeline, such as the random resized crop operation. We argue that the role of the view generation and its effect on performance has so far received insufficient attention. To address this, we propose an easy, learning-free, yet powerful Hard View Selection (HVS) strategy designed to extend the random view generation to expose the pretrained model to harder samples during SSL training. It encompasses the following iterative steps: 1) randomly sample multiple views and create pairs of two views, 2) run forward passes for each view pair on the currently trained model, 3) adversarially select the pair yielding the worst loss depending on the current model state, and 4) run the backward pass with the selected pair. As a result, HVS consistently achieves accuracy improvements between 0.91% and 1.93% on ImageNet linear evaluation and similar improvements on transfer tasks across DINO, SimSiam, iBOT and SimCLR. We provide studies to shed light on the inner workings and show that, by naively using smaller resolution images for the selection step, we can significantly reduce the computational overhead while retaining performance. Surprisingly, even when accounting for the computational overhead incurred by HVS, we achieve performance gains between 0.52% and 1.02% and closely rival the 800-epoch DINO pretraining with only 300 epochs. Various approaches to learn effective and generalizable visual representations in Self-Supervised Learning (SSL) exist. Such views are generated by applying a sequence of (randomly sampled) image transformations and are usually composed of geometric (cropping, rotation, etc.) and appearance (color distortion, blurring, etc.) transformations. A body of literature (Chen et al., 2020a; Wu et al., 2020; Purushwalkam & Gupta, 2020; Wagner et al., 2022; Tian et al., 2020b) has illuminated the effects of image views on representation learning and identified random resized crop (RRC) transformation, which However, despite this finding and to our best knowledge, little research has gone into identifying more effective ways for selecting or generating views to improve performance.