Goto

Collaborating Authors

 augmented view


Align Y our Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization

Neural Information Processing Systems

TPT does not explicitly align the pre-trained CLIP to become aware of the test sample distribution. For the effective test-time adaptation of V -L foundation models, it is crucial to bridge the distribution gap between the pre-training dataset and the downstream evaluation set for high zero-shot generalization.





SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models

Neural Information Processing Systems

Test-time adaptation (TTA) is a special and practical setting in unsupervised domain adaptation, which allows a pre-trained model in a source domain to adapt to unlabeled test data in another target domain. To avoid the computation-intensive backbone fine-tuning process, the zero-shot generalization potentials of the emerging pre-trained vision-language models (e.g., CLIP, CoOp) are leveraged to only tune the run-time prompt for unseen test domains. However, existing solutions have yet to fully exploit the representation capabilities of pre-trained models as they only focus on the entropy-based optimization and the performance is far below the supervised prompt adaptation methods, e.g., CoOp. In this paper, we propose SwapPrompt, a novel framework that can effectively leverage the self-supervised contrastive learning to facilitate the test-time prompt adaptation. SwapPrompt employs a dual prompts paradigm, i.e., an online prompt and a target prompt that averaged from the online prompt to retain historical information. In addition, SwapPrompt applies a swapped prediction mechanism, which takes advantage of the representation capabilities of pre-trained models to enhance the online prompt via contrastive learning. Specifically, we use the online prompt together with an augmented view of the input image to predict the class assignment generated by the target prompt together with an alternative augmented view of the same image. The proposed SwapPrompt can be easily deployed on vision-language models without additional requirement, and experimental results show that it achieves state-of-the-art test-time adaptation performance on ImageNet and nine other datasets. It is also shown that SwapPrompt can even achieve comparable performance with supervised prompt adaptation methods.


AdaptSSR: Pre-training User Model with Augmentation-Adaptive Self-Supervised Ranking

Neural Information Processing Systems

User modeling, which aims to capture users' characteristics or interests, heavily relies on task-specific labeled data and suffers from the data sparsity issue. Several recent studies tackled this problem by pre-training the user model on massive user behavior sequences with a contrastive learning task. Generally, these methods assume different views of the same behavior sequence constructed via data augmentation are semantically consistent, i.e., reflecting similar characteristics or interests of the user, and thus maximizing their agreement in the feature space. However, due to the diverse interests and heavy noise in user behaviors, existing augmentation methods tend to lose certain characteristics of the user or introduce noisy behaviors. Thus, forcing the user model to directly maximize the similarity between the augmented views may result in a negative transfer.


Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning

Neural Information Processing Systems

We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods intrinsically rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using the standard linear evaluation protocol with a standard ResNet-50 architecture and 79.6% with a larger ResNet. We also show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks.


RSA: Reducing Semantic Shift from Aggressive Augmentations for Self-supervised Learning

Neural Information Processing Systems

Most recent self-supervised learning methods learn visual representation by contrasting different augmented views of images. Compared with supervised learning, more aggressive augmentations have been introduced to further improve the diversity of training pairs. However, aggressive augmentations may distort images' structures leading to a severe semantic shift problem that augmented views of the same image may not share the same semantics, thus degrading the transfer performance. To address this problem, we propose a new SSL paradigm, which counteracts the impact of semantic shift by balancing the role of weak and aggressively augmented pairs. Specifically, semantically inconsistent pairs are of minority, and we treat them as noisy pairs.


AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

Shao, Weichuang, Liao, Iman Yi, Maul, Tomas Henrique Bode, Chandesa, Tissa

arXiv.org Artificial Intelligence

Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.


Social and Physical Attributes-Defined Trust Evaluation for Effective Collaborator Selection in Human-Device Coexistence Systems

Zhu, Botao, Wang, Xianbin

arXiv.org Machine Learning

In human-device coexistence systems, collaborations among devices are determined by not only physical attributes such as network topology but also social attributes among human users. Consequently, trust evaluation of potential collaborators based on these multifaceted attributes becomes critical for ensuring the eventual outcome. However, due to the high heterogeneity and complexity of physical and social attributes, efficiently integrating them for accurate trust evaluation remains challenging. To overcome this difficulty, a canonical correlation analysis-enhanced hypergraph self-supervised learning (HSLCCA) method is proposed in this research. First, by treating all attributes as relationships among connected devices, a relationship hypergraph is constructed to comprehensively capture inter-device relationships across three dimensions: spatial attribute-related, device attribute-related, and social attribute-related. Next, a self-supervised learning framework is developed to integrate these multi-dimensional relationships and generate device embeddings enriched with relational semantics. In this learning framework, the relationship hypergraph is augmented into two distinct views to enhance semantic information. A parameter-sharing hypergraph neural network is then utilized to learn device embeddings from both views. To further enhance embedding quality, a CCA approach is applied, allowing the comparison of data between the two views. Finally, the trustworthiness of devices is calculated based on the learned device embeddings. Extensive experiments demonstrate that the proposed HSLCCA method significantly outperforms the baseline algorithm in effectively identifying trusted devices.