RelCon: Relative Contrastive Learning for a Motion Foundation Model for Wearable Data

Xu, Maxwell A., Narain, Jaya, Darnell, Gregory, Hallgrimsson, Haraldur, Jeong, Hyewon, Forde, Darren, Fineman, Richard, Raghuram, Karthik J., Rehg, James M., Ren, Shirley

arXiv.org Artificial Intelligence 

We present RelCon, a novel self-supervised Relative Contrastive learning approach that uses a learnable distance measure in combination with a softened contrastive loss for training an motion foundation model from wearable sensors. The learned distance provides a measurement of semantic similarity between a pair of accelerometer time-series segments, which is used to measure the distance between an anchor and various other sampled candidate segments. The self-supervised model is trained on 1 billion segments from 87,376 participants from a large wearables dataset. The model achieves strong performance across multiple downstream tasks, encompassing both classification and regression. To our knowledge, we are the first to show the generalizability of a self-supervised learning model with motion data from wearables across distinct evaluation tasks. Advances in self-supervised learning (SSL) combined with the availability of large-scale datasets have resulted in a proliferation of foundation models (FMs) in computer vision (Oquab et al., 2023), NLP (OpenAI et al., 2023), and speech understanding (Yang et al., 2024). These models provide powerful, general-purpose representations for a particular domain of data, and support generalization to a broad set of downstream tasks without the need for finetuning. For example, the image representation contained in the DINOv2 (Oquab et al., 2023) model was trained in an entirely selfsupervised way and achieves state-of-the-art performance on multiple dense image prediction tasks such as depth estimation and semantic segmentation, by decoding a frozen base representation with task-specific heads. In contrast to these advances, the times-series have not yet benefited from the foundation model approach, with a few exceptions (Abbaspourazad et al., 2024; Das et al., 2023). This is particularly unfortunate for problems in mobile health (mHealth) signal analysis, which encompasses data modalities such as accelerometry, PPG, and ECG (Rehg et al., 2017), as the collection of mHealth data from participants can be time-consuming and expensive. However, recent advances in self-supervised learning for mHealth signals (Abbaspourazad et al., 2024; Yuan et al., 2024; Xu et al., 2024) have shown promising performance, raising the question of whether it is now feasible to train foundation models for mHealth signals. In this paper, we demonstrate, for the first time, the feasibility of adopting a foundation model approach for the analysis of accelerometry data across tasks. Accelerometry is an important mHealth signal modality that is used in human activity recognition (HAR) (Haresamudram et al., 2022), physical health status assessment (Xu et al., 2022), energy expenditure estimation (Stutz et al., 2024), and gait assessment (Apple, 2021), among many other tasks.