AITopics | audio classification

Collaborating Authors

audio classification

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CoLLA T: On Adding Fine-grained Audio Understanding to Language Models using Token-Level Locked-Language Tuning

Neural Information Processing SystemsNov-19-2025, 19:36:17 GMT

Humans can easily understand various audio concepts, but conventional audio classification models fail due to their inability to predict unseen classes during training.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)

Add feedback

Prototypical Contrastive Learning For Improved Few-Shot Audio Classification

Sgouropoulos, Christos, Nikou, Christos, Vlachos, Stefanos, Theiou, Vasileios, Foukanelis, Christos, Giannakopoulos, Theodoros

arXiv.org Artificial IntelligenceOct-28-2025

Few-shot learning has emerged as a powerful paradigm for training models with limited labeled data, addressing challenges in scenarios where large-scale annotation is impractical. While extensive research has been conducted in the image domain, few-shot learning in audio classification remains relatively underexplored. In this work, we investigate the effect of integrating supervised contrastive loss into prototypical few shot training for audio classification. In detail, we demonstrate that angular loss further improves the performance compared to the standard contrastive loss. Our method leverages SpecAugment followed by a self-attention mechanism to encapsulate diverse information of augmented input versions into one unified embedding. We evaluate our approach on MetaAudio, a benchmark including five datasets with predefined splits, standardized preprocessing, and a comprehensive set of few-shot learning models for comparison. The proposed approach achieves state-of-the-art performance in a 5-way, 5-shot setting.

artificial intelligence, dataset, machine learning, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/MLSP62443.2025.11204215

2509.10074

Country: Europe (0.47)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

CoLLA T: On Adding Fine-grained Audio Understanding to Language Models using Token-Level Locked-Language Tuning

Neural Information Processing SystemsOct-9-2025, 07:05:15 GMT

Humans can easily understand various audio concepts, but conventional audio classification models fail due to their inability to predict unseen classes during training.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)

Add feedback

Large Language Models Implicitly Learn to See and Hear Just By Reading

Verma, Prateek, Pilanci, Mert

arXiv.org Artificial IntelligenceSep-24-2025

This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2505.17091

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Full-Frequency Temporal Patching and Structured Masking for Enhanced Audio Classification

Makineni, Aditya, Geng, Baocheng, Tian, Qing

arXiv.org Artificial IntelligenceSep-1-2025

Transformers and State-Space Models (SSMs) have advanced audio classification by modeling spectrograms as sequences of patches. However, existing models such as the Audio Spectrogram Transformer (AST) and Audio Mamba (AuM) adopt square patching from computer vision, which disrupts continuous frequency patterns and produces an excessive number of patches, slowing training, and increasing computation. We propose Full-Frequency Temporal Patching (FFTP), a patching strategy that better matches the time-frequency asymmetry of spectrograms by spanning full frequency bands with localized temporal context, preserving harmonic structure, and significantly reducing patch count and computation. We also introduce SpecMask, a patch-aligned spectrogram augmentation that combines full-frequency and localized time-frequency masks under a fixed masking budget, enhancing temporal robustness while preserving spectral continuity. When applied on both AST and AuM, our patching method with SpecMask improves mAP by up to +6.76 on AudioSet-18k and accuracy by up to +8.46 on SpeechCommandsV2, while reducing computation by up to 83.26%, demonstrating both performance and efficiency gains.

artificial intelligence, machine learning, spectrogram, (14 more...)

arXiv.org Artificial Intelligence

2508.21243

Country: North America > United States > Alabama (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

15,500 Seconds: Lean UAV Classification Using EfficientNet and Lightweight Fine-Tuning

Berg, Andrew P., Zhang, Qian, Wang, Mia Y.

arXiv.org Artificial IntelligenceAug-15-2025

As unmanned aerial vehicles (UAVs) become increasingly prevalent in both consumer and defense applications, the need for reliable, modality-specific classification systems grows in urgency. This paper addresses the challenge of data scarcity in UAV audio classification by expanding on prior work through the integration of pre-trained deep learning models, parameter-efficient fine-tuning (PEFT) strategies, and targeted data augmentation techniques. Using a custom dataset of 3,100 UAV audio clips (15,500 seconds) spanning 31 distinct drone types, we evaluate the performance of transformer-based and convolutional neural network (CNN) architectures under various fine-tuning configurations. Experiments were conducted with five-fold cross-validation, assessing accuracy, training efficiency, and robustness. Results show that full fine-tuning of the EfficientNet-B0 model with three augmentations achieved the highest validation accuracy (95.95), outperforming both the custom CNN and transformer-based models like AST. These findings suggest that combining lightweight architectures with PEFT and well-chosen augmentations provides an effective strategy for UAV audio classification on limited datasets. Future work will extend this framework to multimodal UAV classification using visual and radar telemetry.

artificial intelligence, arxiv preprint arxiv, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2506.11049

Country:

North America > United States > Indiana > Wayne County > Richmond (0.16)
North America > United States > South Carolina > Charleston County > Charleston (0.15)
North America > United States > Indiana > Tippecanoe County > West Lafayette (0.15)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.15)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Robotics & Automation (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Improving Audio Classification by Transitioning from Zero- to Few-Shot

Taylor, James, Mack, Wolfgang

arXiv.org Artificial IntelligenceJul-29-2025

State-of-the-art audio classification often employs a zero-shot approach, which involves comparing audio embeddings with embeddings from text describing the respective audio class. These embeddings are usually generated by neural networks trained through contrastive learning to align audio and text representations. Identifying the optimal text description for an audio class is challenging, particularly when the class comprises a wide variety of sounds. This paper examines few-shot methods designed to improve classification accuracy beyond the zero-shot approach. Specifically, audio embeddings are grouped by class and processed to replace the inherently noisy text embeddings. Our results demonstrate that few-shot classification typically outperforms the zero-shot baseline.

classification, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2507.20036

Country: Europe > United Kingdom (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.78)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

4,500 Seconds: Small Data Training Approaches for Deep UAV Audio Classification

Berg, Andrew P., Zhang, Qian, Wang, Mia Y.

arXiv.org Artificial IntelligenceJul-25-2025

Unmanned aerial vehicle (UAV) usage is expected to surge in the coming decade, raising the need for heightened security measures to prevent airspace violations and security threats. This study investigates deep learning approaches to UAV classification focusing on the key issue of data scarcity. To investigate this we opted to train the models using a total of 4,500 seconds of audio samples, evenly distributed across a 9-class dataset. We leveraged parameter efficient fine-tuning (PEFT) and data augmentations to mitigate the data scarcity. This paper implements and compares the use of convolutional neural networks (CNNs) and attention-based transformers. Our results show that, CNNs outperform transformers by 1-2\% accuracy, while still being more computationally efficient. These early findings, however, point to potential in using transformers models; suggesting that with more data and further optimizations they could outperform CNNs. Future works aims to upscale the dataset to better understand the trade-offs between these approaches.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.5220/0000195100003967

2505.23782

Country: North America > United States > South Carolina (0.14)

Genre: Research Report > New Finding (0.54)

Industry: Information Technology > Security & Privacy (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation

Attard, Jonathan, Seychell, Dylan

arXiv.org Artificial IntelligenceMar-27-2025

News videos require efficient content organisation and retrieval systems, but their unstructured nature poses significant challenges for automated processing. This paper presents a comprehensive comparative analysis of image, video, and audio classifiers for automated news video segmentation. This work presents the development and evaluation of multiple deep learning approaches, including ResNet, ViViT, AST, and multimodal architectures, to classify five distinct segment types: advertisements, stories, studio scenes, transitions, and visualisations. Using a custom-annotated dataset of 41 news videos comprising 1,832 scene clips, our experiments demonstrate that image-based classifiers achieve superior performance (84.34\% accuracy) compared to more complex temporal models. Notably, the ResNet architecture outperformed state-of-the-art video classifiers while requiring significantly fewer computational resources. Binary classification models achieved high accuracy for transitions (94.23\%) and advertisements (92.74\%). These findings advance the understanding of effective architectures for news video segmentation and provide practical insights for implementing automated content organisation systems in media applications. These include media archiving, personalised content delivery, and intelligent video search.

artificial intelligence, classification, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2503.21848

Country:

Europe > Middle East > Malta > Northern Region > Western District > Attard (0.05)
Europe > Middle East > Malta > Eastern Region > Northern Harbour District > Msida (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Empirical Study of the Impact of Federated Learning on Machine Learning Model Accuracy

Yang, Haotian, Wang, Zhuoran, Chou, Benson, Xu, Sophie, Wang, Hao, Wang, Jingxian, Zhang, Qizhen

arXiv.org Artificial IntelligenceMar-26-2025

Federated Learning (FL) enables distributed ML model training on private user data at the global scale. Despite the potential of FL demonstrated in many domains, an in-depth view of its impact on model accuracy remains unclear. In this paper, we investigate, systematically, how this learning paradigm can affect the accuracy of state-of-the-art ML models for a variety of ML tasks. We present an empirical study that involves various data types: text, image, audio, and video, and FL configuration knobs: data distribution, FL scale, client sampling, and local and global computations. Our experiments are conducted in a unified FL framework to achieve high fidelity, with substantial human efforts and resource investments. Based on the results, we perform a quantitative analysis of the impact of FL, and highlight challenging scenarios where applying FL degrades the accuracy of the model drastically and identify cases where the impact is negligible. The detailed and extensive findings can benefit practical deployments and future development of FL.

accuracy, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2503.20768

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Austria > Vienna (0.14)
Asia > Singapore (0.05)
(16 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Energy (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback