Inductive Learning
SuperPADL: Scaling Language-Directed Physics-Based Control with Progressive Supervised Distillation
Juravsky, Jordan, Guo, Yunrong, Fidler, Sanja, Peng, Xue Bin
Physically-simulated models for human motion can generate high-quality responsive character animations, often in real-time. Natural language serves as a flexible interface for controlling these models, allowing expert and non-expert users to quickly create and edit their animations. Many recent physics-based animation methods, including those that use text interfaces, train control policies using reinforcement learning (RL). However, scaling these methods beyond several hundred motions has remained challenging. Meanwhile, kinematic animation models are able to successfully learn from thousands of diverse motions by leveraging supervised learning methods. Inspired by these successes, in this work we introduce SuperPADL, a scalable framework for physics-based text-to-motion that leverages both RL and supervised learning to train controllers on thousands of diverse motion clips. SuperPADL is trained in stages using progressive distillation, starting with a large number of specialized experts using RL. These experts are then iteratively distilled into larger, more robust policies using a combination of reinforcement learning and supervised learning. Our final SuperPADL controller is trained on a dataset containing over 5000 skills and runs in real time on a consumer GPU. Moreover, our policy can naturally transition between skills, allowing for users to interactively craft multi-stage animations. We experimentally demonstrate that SuperPADL significantly outperforms RL-based baselines at this large data scale.
A Self-Supervised Learning Pipeline for Demographically Fair Facial Attribute Classification
Ramachandran, Sreeraj, Rattani, Ajita
Published research highlights the presence of demographic bias in automated facial attribute classification. The proposed bias mitigation techniques are mostly based on supervised learning, which requires a large amount of labeled training data for generalizability and scalability. However, labeled data is limited, requires laborious annotation, poses privacy risks, and can perpetuate human bias. In contrast, self-supervised learning (SSL) capitalizes on freely available unlabeled data, rendering trained models more scalable and generalizable. However, these label-free SSL models may also introduce biases by sampling false negative pairs, especially at low-data regimes 200K images) under low compute settings. Further, SSL-based models may suffer from performance degradation due to a lack of quality assurance of the unlabeled data sourced from the web. This paper proposes a fully self-supervised pipeline for demographically fair facial attribute classifiers. Leveraging completely unlabeled data pseudolabeled via pre-trained encoders, diverse data curation techniques, and meta-learning-based weighted contrastive learning, our method significantly outperforms existing SSL approaches proposed for downstream image classification tasks. Extensive evaluations on the FairFace and CelebA datasets demonstrate the efficacy of our pipeline in obtaining fair performance over existing baselines. Thus, setting a new benchmark for SSL in the fairness of facial attribute classification.
Accessing Vision Foundation Models at ImageNet-level Costs
Zhang, Yitian, Ma, Xu, Bai, Yue, Wang, Huan, Fu, Yun
Vision foundation models are renowned for their generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could advance research in this field. In this work, we offer a very simple and general solution, named Proteus, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. Leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 15 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M). Code is available at here.
Learning to Represent Surroundings, Anticipate Motion and Take Informed Actions in Unstructured Environments
Contemporary robots have become exceptionally skilled at achieving specific tasks in structured environments. However, they often fail when faced with the limitless permutations of real-world unstructured environments. This motivates robotics methods which learn from experience, rather than follow a pre-defined set of rules. In this thesis, we present a range of learning-based methods aimed at enabling robots, operating in dynamic and unstructured environments, to better understand their surroundings, anticipate the actions of others, and take informed actions accordingly. In the first part of the thesis, we investigate methods which leverage learning to represent the structure and motion in a robot's operating environment, in a continuous manner.
Learning Unlabeled Clients Divergence via Anchor Model Aggregation for Federated Semi-supervised Learning
Elbatel, Marawan, Wang, Hualiang, Chen, Jixiang, Wang, Hao, Li, Xiaomeng
Federated semi-supervised learning (FedSemi) refers to scenarios where there may be clients with fully labeled data, clients with partially labeled, and even fully unlabeled clients while preserving data privacy. However, challenges arise from client drift due to undefined heterogeneous class distributions and erroneous pseudo-labels. Existing FedSemi methods typically fail to aggregate models from unlabeled clients due to their inherent unreliability, thus overlooking unique information from their heterogeneous data distribution, leading to sub-optimal results. In this paper, we enable unlabeled client aggregation through SemiAnAgg, a novel Semi-supervised Anchor-Based federated Aggregation. SemiAnAgg learns unlabeled client contributions via an anchor model, effectively harnessing their informative value. Our key idea is that by feeding local client data to the same global model and the same consistently initialized anchor model (i.e., random model), we can measure the importance of each unlabeled client accordingly. Extensive experiments demonstrate that SemiAnAgg achieves new state-of-the-art results on four widely used FedSemi benchmarks, leading to substantial performance improvements: a 9% increase in accuracy on CIFAR-100 and a 7.6% improvement in recall on the medical dataset ISIC-18, compared with prior state-of-the-art. Code is available at: https://github.com/xmed-lab/SemiAnAgg.
Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data
Feng, Tuo, Wang, Wenguan, Quan, Ruijie, Yang, Yi
Current 3D self-supervised learning methods of 3D scenes face a data desert issue, resulting from the time-consuming and expensive collecting process of 3D scene data. Conversely, 3D shape datasets are easier to collect. Despite this, existing pre-training strategies on shape data offer limited potential for 3D scene understanding due to significant disparities in point quantities. To tackle these challenges, we propose Shape2Scene (S2S), a novel method that learns representations of large-scale 3D scenes from 3D shape data. We first design multiscale and high-resolution backbones for shape and scene level 3D tasks, i.e., MH-P (point-based) and MH-V (voxel-based). MH-P/V establishes direct paths to highresolution features that capture deep semantic information across multiple scales. This pivotal nature makes them suitable for a wide range of 3D downstream tasks that tightly rely on high-resolution features. We then employ a Shape-to-Scene strategy (S2SS) to amalgamate points from various shapes, creating a random pseudo scene (comprising multiple objects) for training data, mitigating disparities between shapes and scenes. Finally, a point-point contrastive loss (PPC) is applied for the pre-training of MH-P/V. In PPC, the inherent correspondence (i.e., point pairs) is naturally obtained in S2SS. Extensive experiments have demonstrated the transferability of 3D representations learned by MH-P/V across shape-level and scene-level 3D tasks. MH-P achieves notable performance on well-known point cloud datasets (93.8% OA on ScanObjectNN and 87.6% instance mIoU on ShapeNetPart). MH-V also achieves promising performance in 3D semantic segmentation and 3D object detection.
Triple-Encoders: Representations That Fire Together, Wire Together
Erker, Justus-Jonas, Mai, Florian, Reimers, Nils, Spanakis, Gerasimos, Gurevych, Iryna
Curved Contrastive Learning, a representation learning method that encodes relative distances between utterances into the embedding space via a bi-encoder, has recently shown promising results for dialog modeling at far superior efficiency. While high efficiency is achieved through independently encoding utterances, this ignores the importance of contextualization. To overcome this issue, this study introduces triple-encoders, which efficiently compute distributed utterance mixtures Figure 1: Comparison of our Triple Encoder to Henderson from these independently encoded utterances et al. (2020) and Erker et al. (2023). Similar to CCL through a novel hebbian inspired co-occurrence we only need to encode and compute similarity scores learning objective in a self-organizing manner, of the latest utterance. At the same time, we achieve without using any weights, i.e., merely through contextualization through pairwise mean-pooling with local interactions. Empirically, we find that previous encoded utterances combining the advantages triple-encoders lead to a substantial improvement of both previous works. Our analysis shows that the over bi-encoders, and even to better zeroshot co-occurrence training pushes representations that occur generalization than single-vector representation (fire) together closer together, leading to stronger models without requiring re-encoding.
SemiAdv: Query-Efficient Black-Box Adversarial Attack with Unlabeled Images
Fan, Mingyuan, Liu, Yang, Chen, Cen, Liu, Ximeng
Adversarial attack has garnered considerable attention due to its profound implications for the secure deployment of robots in sensitive security scenarios. To potentially push for advances in the field, this paper studies the adversarial attack in the black-box setting and proposes an unlabeled data-driven adversarial attack method, called SemiAdv. Specifically, SemiAdv achieves the following breakthroughs compared with previous works. First, by introducing the semi-supervised learning technique into the adversarial attack, SemiAdv substantially decreases the number of queries required for generating adversarial samples. On average, SemiAdv only needs to query a few hundred times to launch an effective attack with more than 90% success rate. Second, many existing black-box adversarial attacks require massive labeled data to mitigate the difference between the local substitute model and the remote target model for a good attack performance. While SemiAdv relaxes this limitation and is capable of utilizing unlabeled raw data to launch an effective attack. Finally, our experiments show that SemiAdv saves up to 12x query accesses for generating adversarial samples while maintaining a competitive attack success rate compared with state-of-the-art attacks.
From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation
Shi, Hanrong, Li, Lin, Xiao, Jun, Zhuang, Yueting, Chen, Long
Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-structure representation based on panoptic segmentation masks. Despite remarkable progress in PSG, almost all existing methods neglect the importance of shape-aware features, which inherently focus on the contours and boundaries of objects. To bridge this gap, we propose a model-agnostic Curricular shApe-aware FEature (CAFE) learning strategy for PSG. Specifically, we incorporate shape-aware features (i.e., mask features and boundary features) into PSG, moving beyond reliance solely on bbox features. Furthermore, drawing inspiration from human cognition, we propose to integrate shape-aware features in an easy-to-hard manner. To achieve this, we categorize the predicates into three groups based on cognition learning difficulty and correspondingly divide the training process into three stages. Each stage utilizes a specialized relation classifier to distinguish specific groups of predicates. As the learning difficulty of predicates increases, these classifiers are equipped with features of ascending complexity. We also incorporate knowledge distillation to retain knowledge acquired in earlier stages. Due to its model-agnostic nature, CAFE can be seamlessly incorporated into any PSG model. Extensive experiments and ablations on two PSG tasks under both robust and zero-shot PSG have attested to the superiority and robustness of our proposed CAFE, which outperforms existing state-of-the-art methods by a large margin.
BaCon: Boosting Imbalanced Semi-supervised Learning via Balanced Feature-Level Contrastive Learning
Feng, Qianhan, Xie, Lujing, Fang, Shijie, Lin, Tong
Semi-supervised Learning (SSL) reduces the need for extensive annotations in deep learning, but the more realistic challenge of imbalanced data distribution in SSL remains largely unexplored. In Class Imbalanced Semi-supervised Learning (CISSL), the bias introduced by unreliable pseudo-labels can be exacerbated by imbalanced data distributions. Most existing methods address this issue at instance-level through reweighting or resampling, but the performance is heavily limited by their reliance on biased backbone representation. Some other methods do perform feature-level adjustments like feature blending but might introduce unfavorable noise. In this paper, we discuss the bonus of a more balanced feature distribution for the CISSL problem, and further propose a Balanced Feature-Level Contrastive Learning method (BaCon). Our method directly regularizes the distribution of instances' representations in a well-designed contrastive manner. Specifically, class-wise feature centers are computed as the positive anchors, while negative anchors are selected by a straightforward yet effective mechanism. A distribution-related temperature adjustment is leveraged to control the class-wise contrastive degrees dynamically. Our method demonstrates its effectiveness through comprehensive experiments on the CIFAR10-LT, CIFAR100-LT, STL10-LT, and SVHN-LT datasets across various settings. For example, BaCon surpasses instance-level method FixMatch-based ABC on CIFAR10-LT with a 1.21% accuracy improvement, and outperforms state-of-the-art feature-level method CoSSL on CIFAR100-LT with a 0.63% accuracy improvement. When encountering more extreme imbalance degree, BaCon also shows better robustness than other methods.