Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes

Roman, Adrian S., Balamurugan, Baladithya, Pothuganti, Rithik

Jan-29-2024–arXiv.org Artificial Intelligence

For this reason, the sound localization This technical report details our work towards building an performance strongly depends on the video content enhanced audio-visual sound event localization and detection [10]. This makes models prone to erroneous SELD on frames (SELD) network. We build on top of the audio-only with no audio or uncorrelated audio activity. SELDnet23 model and adapt it to be audio-visual by merging We introduce a visual branch into the audio-only SELDnet23 both audio and video information prior to the gated recurrent baseline from the Classification of Acoustic Scenes and unit (GRU) of the audio-only network.

baseline, dataset, detection, (14 more...)

arXiv.org Artificial Intelligence

Jan-29-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California (0.14)
- Europe
  - Switzerland > Zürich
    - Zürich (0.14)
  - Finland > Pirkanmaa
    - Tampere (0.04)
- Asia > Middle East
  - Iran (0.05)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Performance Analysis > Accuracy (0.68)
  - Neural Networks > Deep Learning (0.68)