DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
Yi Yuan1,
Xubo Liu1,
Haohe Liu1,
Xiyuan Kang1,
Zhuo Chen2,
Yuxuan Wang2,
Mark D. Plumbley1,
Wenwu Wang1
1CVSSP, University of Surrey, Guildford, UK 2Bytedance, Shenzhen, China
AbstractWith the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short on precisely controlling fine-grained acoustic characteristics of specific sounds. As a result, users that need specific sound content may find it challenging to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation~(CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the customized systems. The experiments show that the proposed model, DreamAudio, generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks. |
---|
Customized Text-to-Audio GenerationGiven only a few audio-language reference pairs $\{(\mathbf{a}_1, \mathbf{t}_1), (\mathbf{a}_2, \mathbf{t}_2), \dots, (\mathbf{a}_k, \mathbf{t}_k)\}$, where each reference audio clip $\mathbf{a}_k$ is associated with a caption $\mathbf{t}_k$. These provided reference concepts can be either user-defined or automatically retrieved audio-caption pairs. Formally, our task is to generate the target audio $\mathbf{a}$ conditioned on the input textual prompt $c_t$ and the reference audios $\mathbf{a}_k$ with their corresponding captions $\mathbf{t}_k$.
|
---|
Demos on Customized-Concatenation
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Demos on Customized-Overlay
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Demos on Customized-Fantasy
|
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
Customized-Fantasy (Full)
WAV | Caption | Reference 1 | Reference 2 | Reference 3 |
---|---|---|---|---|
fancy_1.wav | a Godzilla roar, and then a faint explosion rumbles in the distance, followed by a small dog barks. | a small dog barks | a Godzilla roar | a faint explosion rumbles in the distance |
fancy_2.wav | a child plays with Mario coin sounds, followed by someone cheers. | a child plays with | Mario coin sounds | someone cheers |
fancy_3.wav | two minion chatter with each other, then a train horn sounds in the distance, followed by the champagne popped at the end. | a train horn sounds in the distance | Minion chatter wuth each other | the champagne has been popped |
fancy_4.wav | someone close the door, then a train passes by and a cat knocks a glass off the table. | a train passes by | a cat knocks a glass off the table | the sound of a closing door |
fancy_5.wav | the doorbell rings, and a mother is playing with baby, followed by a Mario jump sound. | a mother is playing with baby | the doorbell rings | a Mario jump sound |
fancy_6.wav | an airplane flies overhead, followed by someone running and taking photos. | an airplane flies overhead | someone running | a camera shutter clicks |
fancy_7.wav | the microwave dings, then a Pikachu thunderbolt effect plays, and finally someone closes the door. | someone closes the door | a Pikachu thunderbolt effect plays | the microwave dings |
fancy_8.wav | footsteps clatter on tile, a T-Rex roars and a cat hisses in surprise. | footsteps clatter on tile | a T-Rex roars | a cat hisses in surprise |
fancy_9.wav | horror music plays and a car dring through, followed by a Godzilla roar. | horror music plays | a Godzilla roar | a car dring through |
fancy_10.wav | a bird is achirping outside while someone closes the door, followed by a child cries. | a child cries | someone closes the door | a bird chirps outside |
fancy_11.wav | it's raining heavily outside, then an umbrella snaps open, followed by a distant car horn honking twice. | heavy rain outside | an umbrella snaps open | a distant car horn honks twice |
fancy_12.wav | a baby is smiling happily, a Transformer changes into a car, and start the engine. | a baby smiles happily | a Transformer changes into a car | start the engine |
fancy_13.wav | a car drives by fast and a Minion make the ambluence sounds, with small bid chrips. | a small bird chirps | a Minion make the ambluence sounds | a car drives by quickly |
fancy_14.wav | the audience starts to clap hands, then a sudden gunshot is heard, followed by a Street Fighter KO sound. | a sudden gunshot | a Street Fighter KO sound | the audience clap hands |
fancy_15.wav | a bus engine revs outside with a horn, and a dog barking. | a bus engine revs outside | the bus horn sounds | a dogd barking |
fancy_16.wav | Some lazer gun shootingwith the Avengers theme playing, and then a door is closed. | a door is opened | the Avengers theme music is heard | lazer gun shooting |
fancy_17.wav | someone uses the keys and closes the door, then a bicycle bell rings outside. | someone unlocks and closes the door | keys jingle | a bicycle bell rings outside |
fancy_18.wav | someone running down and knock the door, a faint Poke Ball recall whoosh is heard. | someone locks the front door | a faint Poke Ball recall whoosh is heard | someone running down the driveway |
fancy_19.wav | with some elegant music, the waiter pops the champagne and pours it into glasses. | elegant music | the waiter pops the champagne | champagne is poured |
fancy_20.wav | someone flicks a lighter, followed by a firework blast. | flicks a lighter | a firework blast | |
fancy_21.wav | a gozila is fighting with exciting music, followed by minions talk | a gozila is fighting | minions talk | exciting music |
fancy_22.wav | a battle music with some Star Wars laser gun sounds, and then the planet explodes. | battle music | Star Wars laser gun sounds | the planet explodes |
Demos on AudioCaps
|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
Acknowledgement
This research was partly supported by a research scholarship from the China Scholarship Council (CSC), funded by British Broadcasting Corporation Research and Development (BBC R&D), Engineering and Physical Sciences Research Council~(EPSRC) Grant EP/T019751/1 'AI for Sound', and a PhD scholarship from the Centre for Vision, Speech and Signal Processing(CVSSP), University of Surrey. For the purpose of open access, the authors have applied a Creative Commons Attribution(CC BY) license to any Author Accepted Manuscript version arising.
Page updated on 8 September 2025