SlotFM: A Motion Foundation Model with Slot Attention for Diverse Downstream Tasks

Park, Junyong, Levy, Oron, Adaimi, Rebecca, Liberman, Asaf, Laput, Gierad, Bedri, Abdelkareem

arXiv.org Artificial Intelligence 

Wearable accelerometers are used for a wide range of applications, such as gesture recognition, gait analysis, and sports monitoring. Y et most existing foundation models focus primarily on classifying common daily activities such as locomotion and exercise, limiting their applicability to the broader range of tasks that rely on other signal characteristics. SlotFM uses Time-Frequency Slot Attention, an extension of Slot Attention that processes both time and frequency representations of the raw signals. It generates multiple small embeddings (slots), each capturing different signal components, enabling task-specific heads to focus on the most relevant parts of the data. We also introduce two loss regularizers that capture local structure and frequency patterns, which improve reconstruction of fine-grained details and helps the embeddings preserve task-relevant information. We evaluate SlotFM on 16 classification and regression downstream tasks that extend beyond standard human activity recognition. It outperforms existing self-supervised approaches on 13 of these tasks and achieves comparable results to the best performing approaches on the remaining tasks. On average, our method yields a 4.5% performance gain, demonstrating strong generalization for sensing foundation models. Advances in self-supervised learning (SSL) and large-scale datasets have enabled foundation models that support multiple tasks through shared representations (Y ang et al., 2024; Oquab et al., 2023). This is particularly valuable for wearable devices, where maintaining separate models dedicated for each task is often impractical due to memory and compute constraints. Accelerometers are widely used sensors in wearables for diverse motion-related tasks. Recent studies show that SSL approaches can train foundation models effective in Human Activity Recognition (HAR) tasks such as exercise and locomotion classification (Logacjov, 2024). However, their applicability to broader accelerometer tasks, such as gait analysis and gesture recognition, remains largely unexplored. This contrasts with domains such as audio, where foundation models have been applied beyond a single task, spanning speech-to-text, speaker identification, and emotion recognition.