Goto

Collaborating Authors

 activity recognition






RAG-HAR: Retrieval Augmented Generation-based Human Activity Recognition

Sivaroopan, Nirhoshan, Karunarathna, Hansi, Madarasingha, Chamara, Jayasumana, Anura, Thilakarathna, Kanchana

arXiv.org Artificial Intelligence

Abstract--Human Activity Recognition (HAR) underpins applications in healthcare, rehabilitation, fitness tracking, and smart environments, yet existing deep learning approaches demand dataset-specific training, large labeled corpora, and significant computational resources. We introduce RAG-HAR, a training-free retrieval-augmented framework that leverages large language models (LLMs) for HAR. RAG-HAR computes lightweight statistical descriptors, retrieves semantically similar samples from a vector database, and uses this contextual evidence to make LLM based activity identification. We further enhance RAG-HAR by first applying prompt optimization and introducing an LLM-based activity descriptor that generates context-enriched vector databases for delivering accurate and highly relevant contextual information. Along with these mechanisms, RAG-HAR achieves state-of-the-art performance across six diverse HAR benchmarks. RAG-HAR moves beyond known behaviors, enabling the recognition and meaningful labelling of multiple unseen human activities. Human Activity Recognition (HAR) from wearable sensor data enables continuous monitoring, anomaly detection, and personalized interventions across healthcare [3], rehabilitation [31], fitness [28], and smart environments [14]. Despite wide-ranging applications, HAR remains challenging due to inter-subject variability, differences in sensor placement, device heterogeneity, and subtle distinctions between activities that exhibit similar motion patterns [39]. Those challenges create a strong need for accurate, generalizable, and cost-efficient solutions. Deep learning (DL) has become the dominant paradigm for HAR, with convolutional neural networks (CNNs) [6], [43], recurrent architectures [15], [17], and attention-based models [2] achieving state-of-the-art (SOT A) performance on benchmark datasets. However, DL-based HAR faces three critical limitations: (i) costly and time-consuming training procedures tailored to each dataset; (ii) performance degradation under domain shift across subjects, sensor placements, or devices; and (iii) heavy dependence on large labeled datasets [7], [35]. Despite advances in DL, these limitations leave HAR without a practical solution that is simultaneously training-free, generalizable, and scalable. To address this gap, this paper explores a fundamentally different paradigm: leveraging Large Language Models (LLMs) as reasoning engines for HAR.


PRISM: Lightweight Multivariate Time-Series Classification through Symmetric Multi-Resolution Convolutional Layers

Zucchi, Federico, Lampert, Thomas

arXiv.org Artificial Intelligence

Multivariate time series classification supports applications from wearable sensing to biomedical monitoring and demands models that can capture both short-term patterns and longer-range temporal dependencies. Despite recent advances, Transformer and CNN models often remain computationally heavy and rely on many parameters. This work presents PRISM (Per-channel Resolution Informed Symmetric Module), a lightweight fully convolutional classifier. Operating in a channel-independent manner, in its early stage it applies a set of multi-resolution symmetric convolutional filters. This symmetry enforces structural constraints inspired by linear-phase FIR filters from classical signal processing, effectively halving the number of learnable parameters within the initial layers while preserving the full receptive field. Across the diverse UEA multivariate time-series archive as well as specific benchmarks in human activity recognition, sleep staging, and biomedical signals, PRISM matches or outperforms state-of-the-art CNN and Transformer models while using significantly fewer parameters and markedly lower computational cost. By bringing a principled signal processing prior into a modern neural architecture, PRISM offers an effective and computationally economical solution for multivariate time series classification.1. Introduction Multivariate time series, characterised by intricate temporal dependencies, are common in finance, healthcare, environmental science, and human activity recognition. Deep learning has improved analysis and classification for such data, yet state-of-the-art models often incur high computational cost, heavy pa-rameterisation, and limited robustness in realistic data regimes. Transformer architectures, adapted from NLP for long-range dependencies, have been applied to time series. Despite promising results, their extensive parameter counts can lead to overfitting and high memory use [1]. In practice, self-attention can struggle with noisy, redundant signals [2, 3].


Hi-OSCAR: Hierarchical Open-set Classifier for Human Activity Recognition

McCarthy, Conor, Quirijnen, Loes, van Zandwijk, Jan Peter, Geradts, Zeno, Worring, Marcel

arXiv.org Artificial Intelligence

Within Human Activity Recognition (HAR), there is an insurmountable gap between the range of activities performed in life and those that can be captured in an annotated sensor dataset used in training. Failure to properly handle unseen activities seriously undermines any HAR classifier's reliability. Additionally within HAR, not all classes are equally dissimilar, some significantly overlap or encompass other sub-activities. Based on these observations, we arrange activity classes into a structured hierarchy. From there, we propose Hi-OSCAR: a Hierarchical Open-set Classifier for Activity Recognition, that can identify known activities at state-of-the-art accuracy while simultaneously rejecting unknown activities. This not only enables open-set classification, but also allows for unknown classes to be localized to the nearest internal node, providing insight beyond a binary "known/unknown" classification. To facilitate this and future open-set HAR research, we collected a new dataset: NFI_FARED. NFI_FARED contains data from multiple subjects performing nineteen activities from a range of contexts, including daily living, commuting, and rapid movements, which is fully public and available for download.


DySTAN: Joint Modeling of Sedentary Activity and Social Context from Smartphone Sensors

Sneh, Aditya, Sahu, Nilesh Kumar, Gupta, Snehil, Lone, Haroon R.

arXiv.org Artificial Intelligence

Accurately recognizing human context from smartphone sensor data remains a significant challenge, especially in sedentary settings where activities such as studying, attending lectures, relaxing, and eating exhibit highly similar inertial patterns. Furthermore, social context plays a critical role in understanding user behavior, yet is often overlooked in mobile sensing research. To address these gaps, we introduce LogMe, a mobile sensing application that passively collects smartphone sensor data (accelerometer, gyroscope, magnetometer, and rotation vector) and prompts users for hourly self-reports capturing both sedentary activity and social context. Using this dual-label dataset, we propose DySTAN (Dynamic Cross-Stitch with Task Attention Network), a multi-task learning framework that jointly classifies both context dimensions from shared sensor inputs. It integrates task-specific layers with cross-task attention to model subtle distinctions effectively. DySTAN improves sedentary activity macro F1 scores by 21.8% over a single-task CNN-BiLSTM-GRU (CBG) model and by 8.2% over the strongest multi-task baseline, Sluice Network (SN). These results demonstrate the importance of modeling multiple, co-occurring context dimensions to improve the accuracy and robustness of mobile context recognition.


Saga: Capturing Multi-granularity Semantics from Massive Unlabelled IMU Data for User Perception

Li, Yunzhe, Hu, Facheng, Zhu, Hongzi, Zhang, Shifan, Zhang, Liang, Chang, Shan, Guo, Minyi

arXiv.org Artificial Intelligence

--Inertial measurement units (IMUs), have been prevalently used in a wide range of mobile perception applications such as activity recognition and user authentication, where a large amount of labelled data are normally required to train a satisfactory model. However, it is difficult to label micro-activities in massive IMU data due to the hardness of understanding raw IMU data and the lack of ground truth. In this paper, we propose a novel fine-grained user perception approach, called Saga, which only needs a small amount of labelled IMU data to achieve stunning user perception accuracy. The core idea of Saga is to first pre-train a backbone feature extraction model, utilizing the rich semantic information of different levels embedded in the massive unlabelled IMU data. Meanwhile, for a specific downstream user perception application, Bayesian Optimization is employed to determine the optimal weights for pre-training tasks involving different semantic levels. We implement Saga on five typical mobile phones and evaluate Saga on three typical tasks on three IMU datasets. Results show that when only using about 100 training samples per class, Saga can achieve over 90% accuracy of the full-fledged model trained on over ten thousands training samples with no additional system overhead. Recent years have witnessed a broad range of user perception applications utilizing inertial measurement units (IMUs), including user authentication [1]-[4], activity recognition [5]- [7], and health monitoring [8], [9]. However, the efficacy of such applications hinges on the availability of expensive and accurately labelled IMU data, which is a requirement often deemed impractical [6], [10]. Given the huge amount of raw IMU data easily generated on mobile devices, it is natural to ask whether users of such mobile devices can be well perceived with very few or even no labelled IMU data, referred to as the IMU-based user perception (IUP) problem. A practical solution to this problem needs to meet the following three rigid requirements. First, the solution can access plenty of unlabelled IMU data but should only require a small amount of labelled data. Second, the solution should be able to achieve high accuracy over multiple user perception tasks simultaneously to meet the diverse user perception needs.


MMA: A Momentum Mamba Architecture for Human Activity Recognition with Inertial Sensors

Nguyen, Thai-Khanh, Vo, Uyen, Nguyen, Tan M., Vo, Thieu N., Le, Trung-Hieu, Pham, Cuong

arXiv.org Artificial Intelligence

Human activity recognition (HAR) from inertial sensors is essential for ubiquitous computing, mobile health, and ambient intelligence. Conventional deep models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and transformers have advanced HAR but remain limited by vanishing or exloding gradients, high computational cost, and difficulty in capturing long-range dependencies. Structured state-space models (SSMs) like Mamba address these challenges with linear complexity and effective temporal modeling, yet they are restricted to first-order dynamics without stable longterm memory mechanisms. We introduce Momentum Mamba, a momentum-augmented SSM that incorporates second-order dynamics to improve stability of information flow across time steps, robustness, and long-sequence modeling. Two extensions further expand its capacity: Complex Momentum Mamba for frequency-selective memory scaling. Experiments on multiple HAR benchmarks demonstrate consistent gains over vanilla Mamba and Transformer baselines in accuracy, robustness, and convergence speed. With only moderate increases in training cost, momentum-augmented SSMs offer a favorable accuracy-efficiency balance, establishing them as a scalable paradigm for HAR and a promising principal framework for broader sequence modeling applications.