sample rate
Slot-VLM: Object-Event Slots for Video-Language Modeling
Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the development of an effective method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a new framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference.
AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch
Shao, Weichuang, Liao, Iman Yi, Maul, Tomas Henrique Bode, Chandesa, Tissa
Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.
- Asia > Malaysia (0.04)
- Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects
Hasan, Jakir, Dipta, Shubhashis Roy
Real-time speech assistants are becoming increasingly popular for ensuring improved accessibility to information. Bengali, being a low-resource language with a high regional dialectal diversity, has seen limited progress in developing such systems. Existing systems are not optimized for real-time use and focus only on standard Bengali. In this work, we present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects. BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication. To address dialectal variation, we introduce a dialect-aware ASR system, BRDialect, developed by fine-tuning the IndicWav2Vec model in ten Bengali regional dialects. It outperforms the baseline ASR models by 12.41-33.98% on the RegSpeech12 dataset. Furthermore, BanglaTalk can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds. Low bandwidth usage and minimal end-to-end delay make the system both cost-effective and interactive for real-time use cases, enabling inclusive and accessible speech technology for the diverse community of Bengali speakers. Code is available in https://github.com/Jak57/BanglaTalk
- Asia > Bangladesh > Rangpur Division > Rangpur District > Rangpur (0.04)
- North America > United States > Maryland > Baltimore County (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (2 more...)
- Health & Medicine (0.68)
- Information Technology (0.67)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Asia > Russia (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.85)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Asia > Russia (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.85)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)
Firstly, we thank all reviewers for the helpful comments and suggestions
Firstly, we thank all reviewers for the helpful comments and suggestions. We will add citations in Table 4. We haven't conducted experiments in language modeling and image density estimation Admittedly, modeling the intra-step correlation would require extra computation time. We will add this discussion in the revised version. We are not entirely sure about the motivation of the multi-frame setting.
Slot-VLM: Object-Event Slots for Video-Language Modeling
Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the development of an effective method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a new framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference. In order to take into account both the spatial object details and the varied temporal dynamics, we build OE-Slots with two branches: the Object-Slots branch and the Event-Slots branch. The Object-Slots branch focuses on extracting object-centric slots from features of high spatial resolution but low frame sample rate, emphasizing detailed object information.