Text-driven Foley Sound Generation With Latent Diffusion Model

Yi Yuan

Haohe Liu

Xubo Liu

Xiyuan Kang

Mark D. Plumbley

Wenwu Wang

University of Surrey, UK

Abstract

Sound generation aims to synthesise the background sound for multimedia content, which involves computationally modelling sound effects with specialized techniques. Previous models usually employ a large number of development sets with labels as input (e.g., single numbers or one-hot vector). In this work, we proposed a diffusion-based generation system for Foley sound generation. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale datasets and fine-tune into this task via transfer learning. We have observed that the feature embedding extracted by the input encoder can significantly affect the performance of the generation model. Hence, a trainable section for transferring the input embedding is introduced into the system to seek the optimal value for each class. What’s more, we utilize a filtering strategy to further refine the output, i.e. by selecting the best results from the candidate clips generated in terms of the sim- ilarity score between the sound and target labels. Using the proposed method, our submitted system ranks 1st in DCASE Challenge 2023 task 7 (DCASE2023-T7). Results in the ablation studies illustrate that the proposed techniques significantly improve the system final performance. Our system is fully open sourced

Foley sound generation system

Our proposed system is based on the widely used structure on sound generation, which consists of an encoder, a generator, a decoder and a vocoder. Instead of directly using labels as the input, we employ a wrapping strategy to generate text descriptions for each label as the initial input for the system, such as text: “someone using keyboard”, for label: “3, keyboard”. Then a trainable embedding is concatenated into this input embedding with LDM loss as a guide to seek the most suitable embedding of each sound. During the sampling stage, the system takes the text embedding from CLAP as the condition and utilizes the LDM model to generate the intermediate value of the sound feature in tokens. Subsequently, the mel-spectrogram-result can be decoded from the tokens by the VAE decoder and reconstructed into waveform by the GAN vocoder.

Sound generation results

We demonstrate the sound samples generated by our proposed system. We have seven different classes:

(1) Dog bark

(2) Foot steps

(3) Gun shot

(4) Keyboard

(5) Moving motor vehicle

(6) Rain

(7) Sneeze/cough

Input label : ''0 , DogBark ''

Wraped text: ''A dog bark ''


sample 1	sample 2	sample 3	sample 4

Input label : ''1 , Footstep ''

Wraped text: ''Foot steps''


sample 1	sample 2	sample 3	sample 4

Input label : ''2 , GunShot ''

Wraped text: ''Gun shot''


sample 1	sample 2	sample 3	sample 4

Input label : ''3 , Keyboard ''

Wraped text: ''Someone using keyboard''


sample 1	sample 2	sample 3	sample 4

Input label : ''4 , Moving_motor_vehicle ''

Wraped text: ''A moving motor''


sample 1	sample 2	sample 3	sample 4

Input label : ''5 , Rain ''

Wraped text: ''Rain''


sample 1	sample 2	sample 3	sample 4

Input label : ''6 , Sneeze_cough ''

Wraped text: ''A man cough''


sample 1	sample 2	sample 3	sample 4

Technical Report

Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Mark D.Plumbley, Wenwu Wang

Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7

[Bibtex] [ArXiv] [Code]

Acknowledgements

This research was partly supported by a research scholarship from the China Scholarship Council (CSC) No.202208060240, the British Broadcasting Corporation Research and Development (BBC R&D), Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 “AI for Sound”, and a PhD scholarship from the Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising.