Goto

Collaborating Authors

 Inductive Learning


SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation

arXiv.org Artificial Intelligence

Abstract--Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. This task typically uses parametric human performance across a basket of key benchmarks, in order to models (e.g., SMPL-X [1]) as a powerful representation provide a holistic measurement of generalization capabilities. of the human body, face, and hands. With a flurry of Our study underscores the importance of harnessing a diverse datasets entering the scene in recent years [2], [3], multitude of datasets to capitalize on their complementary [4], [5], [6], [7], [8], [9], [10], [11], providing the community nature. Moreover, we contribute a new dataset, SynHand, new opportunities to study various aspects such as capture to provide the community with a long-awaiting benchmark environment, pose distribution, body visibility, and camera for comprehensive hand pose evaluation in a whole-body views. Yet, the state-of-the-art methods channel their attention setting. SynHand features diverse hand poses in close-up towards advancements in architectural designs and human shots, accurately annotated as part of the wholebody remain tethered to a limited selection of these datasets, SMPL-X labels. Accordingly, we establish a systematic benchmark results across various scenarios.


Deeply Learning the Messages in Message Passing Inference

Neural Information Processing Systems

Deep structured output learning shows great promise in tasks like semantic image segmentation. We proffer a new, efficient deep structured model learning scheme, in which we show how deep Convolutional Neural Networks (CNNs) can be used to directly estimate the messages in message passing inference for structured prediction with Conditional Random Fields CRFs). With such CNN message estimators, we obviate the need to learn or evaluate potential functions for message calculation. This confers significant efficiency for learning, since otherwise when performing structured learning for a CRF with CNN potentials it is necessary to undertake expensive inference for every stochastic gradient iteration. The network output dimension of message estimators is the same as the number of classes, rather than exponentially growing in the order of the potentials.


Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning

Neural Information Processing Systems

We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods intrinsically rely on negative pairs, BYOL achieves a new state of the art without them.


Sharper Generalization Bounds for Pairwise Learning

Neural Information Processing Systems

Pairwise learning refers to learning tasks with loss functions depending on a pair of training examples, which includes ranking and metric learning as specific examples. Recently, there has been an increasing amount of attention on the generalization analysis of pairwise learning to understand its practical behavior. However, the existing stability analysis provides suboptimal high-probability generalization bounds. In this paper, we provide a refined stability analysis by developing generalization bounds which can be \sqrt{n} -times faster than the existing results, where n is the sample size. This implies excess risk bounds of the order O(n {-1/2}) (up to a logarithmic factor) for both regularized risk minimization and stochastic gradient descent. We also introduce a new on-average stability measure to develop optimistic bounds in a low noise setting.


Improving Barely Supervised Learning by Discriminating Unlabeled Samples with Super-Class

Neural Information Processing Systems

In semi-supervised learning (SSL), a common practice is to learn consistent information from unlabeled data and discriminative information from labeled data to ensure both the immutability and the separability of the classification model. Existing SSL methods suffer from failures in barely-supervised learning (BSL), where only one or two labels per class are available, as the insufficient labels cause the discriminative information being difficult or even infeasible to learn. To bridge this gap, we investigate a simple yet effective way to leverage unlabeled samples for discriminative learning, and propose a novel discriminative information learning module to benefit model training. Specifically, we formulate the learning objective of discriminative information at the super-class level and dynamically assign different classes into different super-classes based on model performance improvement. On top of this on-the-fly process, we further propose a distribution-based loss to learn discriminative information by utilizing the similarity relationship between samples and super-classes.


Bridging the Gap from Asymmetry Tricks to Decorrelation Principles in Non-contrastive Self-supervised Learning

Neural Information Processing Systems

While they are attractive since they do not need negative samples, it necessitates some mechanism to avoid collapsing into a trivial solution. Currently, there are two approaches to collapse prevention. One uses an asymmetric architecture on a joint embedding of input, e.g., BYOL and SimSiam, and the other imposes decorrelation criteria on the same joint embedding, e.g., Barlow-Twins and VICReg. The latter methods have theoretical support from information theory as to why they can learn good representation. However, it is not fully understood why the former performs equally well. In this paper, focusing on BYOL/SimSiam, which uses the stop-gradient and a predictor as asymmetric tricks, we present a novel interpretation of these tricks; they implicitly impose a constraint that encourages feature decorrelation similar to Barlow-Twins/VICReg.


Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-Based Selection

arXiv.org Artificial Intelligence

Compared with standard learning, adversarially robust learning is widely recognized to demand significantly more training examples. Recent works propose the use of self-supervised adversarial training (SSAT) with external or synthetically generated unlabeled data to enhance model robustness. However, SSAT requires a substantial amount of extra unlabeled data, significantly increasing memory usage and model training times. To address these challenges, we propose novel methods to strategically select a small subset of unlabeled data essential for SSAT and robustness improvement. Our selection prioritizes data points near the model's decision boundary based on latent clustering-based techniques, efficiently identifying a critical subset of unlabeled data with a higher concentration of boundary-adjacent points. While focusing on near-boundary data, our methods are designed to maintain a balanced ratio between boundary and non-boundary data points to avoid overfitting. Our experiments on image benchmarks show that integrating our selection strategies into self-supervised adversarial training can largely reduce memory and computational requirements while achieving high model robustness. In particular, our latent clustering-based selection method with k-means is the most effective, achieving nearly identical test-time robust accuracies with 5 to 10 times less external or generated unlabeled data when applied to image benchmarks. Additionally, we validate the generalizability of our approach across various application scenarios, including a real-world medical dataset for COVID-19 chest X-ray classification.


Learning discrete distributions: user vs item-level privacy

Neural Information Processing Systems

Much of the literature on differential privacy focuses on item-level privacy, where loosely speaking, the goal is to provide privacy per item or training example. However, recently many practical applications such as federated learning require preserving privacy for all items of a single user, which is much harder to achieve. Therefore understanding the theoretical limit of user-level privacy becomes crucial. We study the fundamental problem of learning discrete distributions over k symbols with user-level differential privacy. If each user has m samples, we show that straightforward applications of Laplace or Gaussian mechanisms require the number of users to be \mathcal{O}(k/(m\alpha 2) k/\epsilon\alpha) to achieve an \ell_1 distance of \alpha between the true and estimated distributions, with the privacy-induced penalty k/\epsilon\alpha independent of the number of samples per user m .


Beyond neural scaling laws: beating power law scaling via data pruning

Neural Information Processing Systems

Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how in theory we can break beyond power law scaling and potentially even reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this improved scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling in practice on ResNets trained on CIFAR-10, SVHN, and ImageNet. Next, given the importance of finding high-quality pruning metrics, we perform the first large-scale benchmarking study of ten different data pruning metrics on ImageNet.


Measuring and Reducing Model Update Regression in Structured Prediction for NLP

Neural Information Processing Systems

Recent advance in deep learning has led to rapid adoption of machine learning based NLP models in a wide range of applications. Despite the continuous gain in accuracy, backward compatibility is also an important aspect for industrial applications, yet it received little research attention. Backward compatibility requires that the new model does not regress on cases that were correctly handled by its predecessor. This work studies model update regression in structured prediction tasks. We choose syntactic dependency parsing and conversational semantic parsing as representative examples of structured prediction tasks in NLP.