Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis

Zhiqi Huang1* Dan Luo1* Jun Wang2† Huan Liao1 Zhiheng Li1† Zhiyong Wu1†

*Equal Contribution    Corresponding author    1Tsinghua University    2Tencent AI Lab    

arxiv[Paper]      github[Code]      huggingface[Demo]      YouTube[YouTube]

Abstract

Our research introduces an innovative framework for video-to-audio synthesis, which solves the problems of audio-video desynchronization and semantic loss in the audio. By incorporating a Semantic Alignment Adapter and a Temporal Synchronization Adapter, our method significantly improves semantic integrity and the precision of beat point synchronization, particularly in fast-paced action sequences. Utilizing a contrastive audio-visual pre-trained encoder, our model is trained with video and high-quality audio data, improving the quality of the generated audio. This dual-adapter approach empowers users with enhanced control over audio semantics and beat effects, allowing the adjustment of the controller to achieve better results. Extensive experiments substantiate the effectiveness of our framework in achieving seamless audio-visual alignment.

Method



Providing a silent video, Rhythmic Foley can generate audio that is both semantically aligned and precisely timed to match key moments. Our framework is based on the video-to-audio synthesis. During the training phase, we employed an audio-visual alignment encoder to encode the data and separately trained two adapters. The semantic alignment adapter integrates audio event information from Mini-Gemini with the approximate temporal intervals of these events, as detected by a visual detector. Meanwhile, the temporal synchronizationadapter receives rhythmic cues extracted by an energy detector. In the inference phase, to meet more refined generation requirements, users can leverage our annotation tool as temporal conditional input. By merging these adapters and tuning their respective weights, models can generate high-quality sound effects.

Demos

BibTex

@article{huang2024rhythmicfoleyframeworkseamless,
title={Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis},
author={Zhiqi Huang and Dan Luo and Jun Wang and Huan Liao and Zhiheng Li and Zhiyong Wu},
year={2024}
}