Goto

Collaborating Authors

 audio processing


CAK: Emergent Audio Effects from Minimal Deep Learning

Rockman, Austin

arXiv.org Artificial Intelligence

We demonstrate that a single 3x3 convolutional kernel can produce emergent audio effects when trained on 200 samples from a personalized corpus. We achieve this through two key techniques: (1) Conditioning Aware Kernels (CAK), where output = input + (learned_pattern x control), with a soft-gate mechanism supporting identity preservation at zero control; and (2) AuGAN (Audit GAN), which reframes adversarial training from "is this real?" to "did you apply the requested value?" Rather than learning to generate or detect forgeries, our networks cooperate to verify control application, discovering unique transformations. The learned kernel exhibits a diagonal structure creating frequency-dependent temporal shifts that are capable of producing musical effects based on input characteristics. Our results show the potential of adversarial training to discover audio transformations from minimal data, enabling new approaches to effect design.


Audio Processing using Pattern Recognition for Music Genre Classification

Chatterjee, Sivangi, Ganguly, Srishti, Bose, Avik, Prasad, Hrithik Raj, Ghosal, Arijit

arXiv.org Artificial Intelligence

This project explores the application of machine learning techniques for music genre classification using the GTZAN dataset, which contains 100 audio files per genre. Motivated by the growing demand for personalized music recommendations, we focused on classifying five genres--Blues, Classical, Jazz, Hip Hop, and Country--using a variety of algorithms including Logistic Regression, K-Nearest Neighbors (KNN), Random Forest, and Artificial Neural Networks (ANN) implemented via Keras. The ANN model demonstrated the best performance, achieving a validation accuracy of 92.44%. We also analyzed key audio features such as spectral roll-off, spectral centroid, and MFCCs, which helped enhance the model's accuracy. Future work will expand the model to cover all ten genres, investigate advanced methods like Long Short-Term Memory (LSTM) networks and ensemble approaches, and develop a web application for real-time genre classification and playlist generation. This research aims to contribute to improving music recommendation systems and content curation.


Meta-Learning in Audio and Speech Processing: An End to End Comprehensive Review

Raimon, Athul, Masti, Shubha, Sateesh, Shyam K, Vengatagiri, Siyani, Das, Bhaskarjyoti

arXiv.org Artificial Intelligence

This survey overviews various meta-learning approaches used in audio and speech processing scenarios. Meta-learning is used where model performance needs to be maximized with minimum annotated samples, making it suitable for low-sample audio processing. Although the field has made some significant contributions, audio meta-learning still lacks the presence of comprehensive survey papers. We present a systematic review of meta-learning methodologies in audio processing. This includes audio-specific discussions on data augmentation, feature extraction, preprocessing techniques, meta-learners, task selection strategies and also presents important datasets in audio, together with crucial real-world use cases. Through this extensive review, we aim to provide valuable insights and identify future research directions in the intersection of meta-learning and audio processing.


Emotion Talk: Emotional Support via Audio Messages for Psychological Assistance

Almada, Fabrycio Leite Nakano, Mariano, Kauan Divino Pouso, Dutra, Maykon Adriell, Monteiro, Victor Emanuel da Silva

arXiv.org Artificial Intelligence

This paper presents "Emotion Talk," a system designed to provide continuous emotional support through audio messages for psychological assistance. The primary objective is to offer consistent support to patients outside traditional therapy sessions by analyzing audio messages to detect emotions and generate appropriate responses. The solution focuses on Portuguese-speaking users, ensuring that the system is linguistically and culturally relevant. This system aims to complement and enhance the psychological follow-up process conducted by therapists, providing immediate and accessible assistance, especially in emergency situations where rapid response is crucial. Experimental results demonstrate the effectiveness of the proposed system, highlighting its potential in applications of psychological support.


NEUROSEC: FPGA-Based Neuromorphic Audio Security

Isik, Murat, Vishwamith, Hiruna, Sur, Yusuf, Inadagbo, Kayode, Dikmen, I. Can

arXiv.org Artificial Intelligence

Neuromorphic systems, inspired by the complexity and functionality of the human brain, have gained interest in academic and industrial attention due to their unparalleled potential across a wide range of applications. While their capabilities herald innovation, it is imperative to underscore that these computational paradigms, analogous to their traditional counterparts, are not impervious to security threats. Although the exploration of neuromorphic methodologies for image and video processing has been rigorously pursued, the realm of neuromorphic audio processing remains in its early stages. Our results highlight the robustness and precision of our FPGA-based neuromorphic system. Specifically, our system showcases a commendable balance between desired signal and background noise, efficient spike rate encoding, and unparalleled resilience against adversarial attacks such as FGSM and PGD. A standout feature of our framework is its detection rate of 94%, which, when compared to other methodologies, underscores its greater capability in identifying and mitigating threats within 5.39 dB, a commendable SNR ratio. Furthermore, neuromorphic computing and hardware security serve many sensor domains in mission-critical and privacy-preserving applications.


Neural Harmonium: An Interpretable Deep Structure for Nonlinear Dynamic System Identification with Application to Audio Processing

Helwani, Karim, Soltanmohammadi, Erfan, Goodwin, Michael M.

arXiv.org Artificial Intelligence

Improving the interpretability of deep neural networks has recently gained increased attention, especially when the power of deep learning is leveraged to solve problems in physics. Interpretability helps us understand a model's ability to generalize and reveal its limitations. In this paper, we introduce a causal interpretable deep structure for modeling dynamic systems. Our proposed model makes use of the harmonic analysis by modeling the system in a time-frequency domain while maintaining high temporal and spectral resolution. Moreover, the model is built in an order recursive manner which allows for fast, robust, and exact second order optimization without the need for an explicit Hessian calculation. To circumvent the resulting high dimensionality of the building blocks of our system, a neural network is designed to identify the frequency interdependencies. The proposed model is illustrated and validated on nonlinear system identification problems as required for audio signal processing tasks. Crowd-sourced experimentation contrasting the performance of the proposed approach to other state-of-the-art solutions on an acoustic echo cancellation scenario confirms the effectiveness of our method for real-life applications.


Asca: less audio data is more insightful

Li, Xiang, Chen, Junhao, Li, Chao, Lv, Hongwu

arXiv.org Artificial Intelligence

Audio recognition in specialized areas such as birdsong and submarine acoustics faces challenges in large-scale pre-training due to the limitations in available samples imposed by sampling environments and specificity requirements. While the Transformer model excels in audio recognition, its dependence on vast amounts of data becomes restrictive in resource-limited settings. Addressing this, we introduce the Audio Spectrogram Convolution Attention (ASCA) based on CoAtNet, integrating a Transformer-convolution hybrid architecture, novel network design, and attention techniques, further augmented with data enhancement and regularization strategies. On the BirdCLEF2023 and AudioSet(Balanced), ASCA achieved accuracies of 81.2% and 35.1%, respectively, significantly outperforming competing methods. The unique structure of our model enriches output, enabling generalization across various audio detection tasks. Our code can be found at https://github.com/LeeCiang/ASCA.


Self-driving cars offer chance to re-imagine sound systems

#artificialintelligence

As auto manufacturers continue developing tomorrow's self-driving cars, there's a parallel design process taking shape. Because as soon as engineers have perfected the autonomous vehicle, a new kind of passenger experience will take center stage, one that includes the sophisticated integration of rich, enveloping audio and video. Self-driving cars won't only be used for transportation. Conveniently, passengers will be able to work as they travel along, holding video meetings and conference calls. Or they will enjoy music, TV and movies much as they would in their own home theaters.


NeurIPS 2020

#artificialintelligence

Back in February, when AI conferences were still held in-person, Turing Award winners Geoffrey Hinton, Yann LeCun and Yoshua Bengio shared a stage in New York at an AAAI event, which Syncedcovered in detail. LeCun told the audience that, after decades of skepticism, he had finally joined Hinton in support of the idea that self-supervised learning may usher in AI's next revolution. Unlike supervised learning, which requires manual data-labelling, self-supervised learning (SSL) is an approach that can automatically generate labels. Recent improvements in self-supervised training methods have established SSL as a serious alternative to traditional supervised training. Google's language representation model ALBERT for example utilizes a self-supervised training framework to leverage large amounts of text. It's no surprise then that NeurIPS 2020 (the Conference on Neural Information Processing Systems) would find itself at the forefront of this trend.


The promise of AI in audio processing – Towards Data Science

#artificialintelligence

We have seen a rise of AI technologies for image and video processing. Even though things tend to take a little while longer making it to the world of audio, here we have also seen impressive technological advances. In this article, I will summarize some of these advances, outline further potentials of AI in audio processing as well as describe some of the possible pitfalls and challenges we might encounter in pursuing this cause. The kicker for my interest in AI use cases for audio processing was the publication of Google Deepmind's "WaveNet" -- A deep learning model for generating audio recordings [1] which was released during the end of 2016. Using an adapted network architecture, a dilated convolutional neural network, Deepmind researchers succeeded in generating very convincing text-to-speech and some interesting music-like recordings trained from classical piano recordings.