recognition
An Elastic Shape Variational Autoencoder for Skeleton Pose Trajectories
Rahman, Arafat, Kumar, Shashwat, Barnes, Laura E., Srivastava, Anuj
Deep generative models provide flexible frameworks for modeling complex, structured data such as images, videos, 3D objects, and texts. However, when applied to sequences of human skeletons, standard variational autoencoders (VAEs) often allocate substantial capacity to nuisance factors-such as camera orientation, subject scale, viewpoint, and execution speed-rather than the intrinsic geometry of shapes and their motion. We propose the Elastic Shape - Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal trajectories that leverages the transported square-root velocity field (TSRVF) representation on Kendall's shape manifold. This representation inherently removes rigid translations, rotations, and global scaling of shapes, and temporal rate variability of sequences, isolating the underlying shape dynamics. The ES-VAE encoder maps skeletal sequences to a low-dimensional latent space incorporating the Riemannian logarithm map, while the decoder reconstructs sequences using the corresponding exponential map. We demonstrate the effectiveness of ES-VAE on two datasets. First, we analyze skeletal gait cycles to predict clinical mobility scores and classify subjects into healthy and post-stroke groups. Second, we evaluate action recognition on the NTU RGB+D dataset. Across both settings, ES-VAE consistently outperforms standard VAEs and a range of sequence modeling baselines, including temporal convolutional networks, transformers, and graph convolutional networks. More broadly, ES-VAE provides a principled framework for learning generative models of longitudinal data on pose shape manifolds, offering improved latent representation and downstream performance compared to existing deep learning approaches.
Is Big Brother watching you shop? โ podcast
Is Big Brother watching you shop? - podcast From supermarkets to corner shops, live facial recognition could be coming to retailers near you. Live facial recognition is being hailed as a powerful new frontier in the fight against crime, not only by police but by private companies too. Retailers from supermarkets to corner shops hope it will help them fight back against shoplifting. And the technology doesn't always get it right. With more police forces wanting to take up the technology, what could the consequences be?
Samsung's Bespoke update is big step towards a useful AI for your fridge
Samsung's Bespoke update is big step towards a useful AI for your fridge Samsung's Bespoke update is big step towards a useful AI for your fridge The idea of installing a software update on your fridge already feels kind of weird, let alone one centered around improving its AI capabilities. But that's exactly what's happening to Samsung's line of Bespoke refrigerators this week, and to my surprise this patch is making major strides at providing truly useful machine learning in a modern day icebox. As a quick recap, Samsung has offered AI-powered features like automatic food recognition and meal planning on its Bespoke refrigerators for a couple years already. However, as I found out after reviewing its flagship model late last year, the company's AI capabilities are still very much a work in progress. Previously, the fridge could recognize around 60 different kinds of fresh foods (like fruits and veggies) alongside another 50 or so packaged goods like yogurt or popcorn.
M5HisDoc: ALarge-scale Multi-style Chinese Historical Document Analysis Benchmark
Recognizing and organizing text in correct reading order plays a crucial role in historical document analysis and preservation. While existing methods have shown promising performance, they often struggle with challenges such as diverse layouts, low image quality, style variations, and distortions. This is primarily due to the lack of consideration for these issues in the current benchmarks, which hinders the development and evaluation of historical document analysis and recognition (HDAR) methods in complex real-world scenarios. To address this gap, this paper introduces a complex multi-style Chinese historical document analysis benchmark, named M5HisDoc. The M5 indicates five properties of style, ie., Multiple layouts, Multiple document types, Multiple calligraphy styles, Multiple backgrounds, and Multiple challenges.
Focal Modulation Networks
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation module for modeling token interactions in vision. Focal modulation comprises three components: (i)hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to fuse the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational cost on the tasks of image classification, object detection, and semantic segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K.
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model & task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model.