AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch
Shao, Weichuang, Liao, Iman Yi, Maul, Tomas Henrique Bode, Chandesa, Tissa
–arXiv.org Artificial Intelligence
Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.
arXiv.org Artificial Intelligence
Nov-25-2025
- Country:
- Asia > Malaysia (0.04)
- Europe
- Italy > Calabria
- Catanzaro Province > Catanzaro (0.04)
- United Kingdom > England
- Nottinghamshire > Nottingham (0.04)
- Italy > Calabria
- Genre:
- Research Report > New Finding (0.68)
- Technology: