Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries

Sulun, Serkan, Viana, Paula, Davies, Matthew E. P.

Feb-14-2025–arXiv.org Artificial Intelligence

We introduce EMSYNC, a video-based symbolic music generation model that aligns music with a video's emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate and align musical chords with scene cuts. Unlike existing models, our approach retains event-based encoding, ensuring fine-grained timing control and expressive musical nuances. We also propose a mapping scheme to bridge the video emotion classifier, which produces discrete emotion categories, with the emotion-conditioned MIDI generator, which operates on continuous-valued valence-arousal inputs. In subjective listening tests, EMSYNC outperforms state-of-the-art models across all subjective metrics, for music theory-aware participants as well as the general listeners.

artificial intelligence, boundary, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Feb-14-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report > Promising Solution (0.34)

Industry:
- Media > Music (1.00)
- Leisure & Entertainment (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Cognitive Science > Emotion (0.68)
  - Representation & Reasoning (0.66)
  - Machine Learning > Neural Networks (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found