DreamAudio: Customized Text-to-Audio Generation with Diffusion Models



Yi Yuan1, Xubo Liu1, Haohe Liu1, Xiyuan Kang1, Zhuo Chen2,
Yuxuan Wang2, Mark D. Plumbley1, Wenwu Wang1

1CVSSP, University of Surrey, Guildford, UK     2Bytedance, Shenzhen, China


Paper                   Code(coming soon)


Abstract

With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short on precisely controlling fine-grained acoustic characteristics of specific sounds. As a result, users that need specific sound content may find it challenging to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation~(CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the customized systems. The experiments show that the proposed model, DreamAudio, generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.

Customized Text-to-Audio Generation

Given only a few audio-language reference pairs $\{(\mathbf{a}_1, \mathbf{t}_1), (\mathbf{a}_2, \mathbf{t}_2), \dots, (\mathbf{a}_k, \mathbf{t}_k)\}$, where each reference audio clip $\mathbf{a}_k$ is associated with a caption $\mathbf{t}_k$. These provided reference concepts can be either user-defined or automatically retrieved audio-caption pairs. Formally, our task is to generate the target audio $\mathbf{a}$ conditioned on the input textual prompt $c_t$ and the reference audios $\mathbf{a}_k$ with their corresponding captions $\mathbf{t}_k$.

Figure 1: Framework of DreamAudio.



Demos on Customized-Concatenation

Text Query
DreamAudio
Tango
ReAudioLDM
AudioGen
MakeAnAudio
Reference 1
Reference 2
Reference 3
Reference 4
"a slap and background noise followed by a bell rings in the background followed by a sound effect and another sound effect"
 

 

 

 

 

"a slap and background noise"

"a bell rings in the background"

"a sound effect"

"another sound effect"

"a woman and man are speaking, and a slapping sound is heard along with human sounds and mechanisms"
 
 

 
 

 
 

 
 

 
 

"a woman and man are speaking, and a slapping sound is heard"

Text Query
DreamAudio
Tango
ReAudioLDM
AudioGen
MakeAnAudio
Reference 1
Reference 2
Reference 3
Reference 4
"men speaking, a horse neighing, and music, then and are heard followed by boing sounds"
 

 

 

 

 

"men speaking and a horse neighing"

"Music"

"Boing sounds"

"the sound of drilling then firework rocket is being fired ,followed by water is hitting a sauna heater"
 

 

 

 

 

"the sound of drilling"

"firework rocket is being fired"

"a guitar is playing a note , follow by sound of engine running and people drilling, and then a bike is breaking on concrete"
 

 

 

 

 

"a guitar is playing a note"

"sound of engine running"

"children playing then pig bleating then a car horn"
 

 

 

 

 

"pig bleating"



Demos on Customized-Overlay

Text Query
DreamAudio
Tango
ReAudioLDM
AudioGen
MakeAnAudio
Reference Base
Reference Front
Reference Back
"a door is being unlocked with a key then an alarm clock chirps and mechanisms sound at the sametime, sound of airplane engine"
 
 

 
 

 
 

 
 

 
 

"an alarm clock chirps and mechanisms sound"

"a door is being unlocked with a key"
 

"sound of airplane engine"
 

"a cymbal is crashing then a woman is singing with noise"
 

 

 

 

 

"a woman is singing with noise"

"a cymbal is crashing"

Text Query
DreamAudio
Tango
ReAudioLDM
AudioGen
MakeAnAudio
Reference Base
Reference Front
Reference Back
"people are drilling in the background, men speak, mechanisms make noise, then a sound is being played"
 
 

 
 

 
 

 
 

 
 

"men speak, mechanisms make noise"

"people are drilling"
 

"a sound is being played"
 

"Instant explosions with food processor blends and people speaking, birds chirping and a child is speaking in the background, someone is hitting a beer bottle with a hammer"
 
 
 

 
 
 

 
 
 

 
 
 

 
 
 

"food processor blends and people speaking, birds chirping and a child is speaking"

"instant explosions"
 
 

"someone is hitting a beer bottle with a hammer"
 

"amen is being repeated in the background, rain is hitting a surface with some sounds of dog bark"
 

 

 

 

 

"rain is hitting a surface"

"amen is being repeated"

"some sounds of dog bark"

"jazzy is played while chewing and background noise are heard then dog barking"
 
 

 
 

 
 

 
 

 
 

"chewing and background noise are heard"

"jazzy is played"
 

"dog barking"
 



Demos on Customized-Fantasy

Text Query
DreamAudio
Reference 1
Reference 2
Reference 3
"a gozila is fighting with exciting music, followed by minions talk"
 

"a gozila is fighting"

"minion talk"

"exciting music"

"with some elegant music, the waiter pops the champagne and pours it into glasses"
 

"elegant music"

"the waiter pops the champagne"

"champagne is poured"

Text Query
DreamAudio
Reference 1
Reference 2
Reference 3
"someone close the door, then a train passes by and a cat knocks a glass off the table"
 

"a train passes by"

"cat knocks a glass off the table"

"closing door"

"the microwave dings, then a Pikachu thunderbolt effect plays, and finally someone closes the door"
 

"someone closes the door"

"a Pikachu thunderbolt effect plays"

"the microwave dings"

"horror music plays and a car dring through, followed by a Godzilla roar"
 

"horror music plays"

"a Godzilla roar"

"a car dring through"

"a car drives by fast and a Minion make the ambluence sounds, with small bid chrips"
 
 

"a small bird chirps"
 

"a Minion make the ambluence sounds"

"a car drives by quickly"
 

"the audience starts to clap hands, then a sudden gunshot is heard, followed by a Street Fighter KO sound"
 

"a sudden gunshot"

"a Street Fighter KO sound"

"the audience clap hands"

"a bus engine revs outside with a horn, and a dog barking"
 

"a bus engine revs outside"

"the bus horn sounds"

"dog barking"

"some lazer gun shootingwith the Avengers theme playing, and then a door is closed"
 

"a door is opened"

"the Avengers theme music"

"lazer gun shooting"

"someone flicks a lighter, followed by a firework blast"
 

"flicks a lighter"

"a firework blast"



Customized-Fantasy (Full)

WAV Caption Reference 1 Reference 2 Reference 3
fancy_1.wav a Godzilla roar, and then a faint explosion rumbles in the distance, followed by a small dog barks. a small dog barks a Godzilla roar a faint explosion rumbles in the distance
fancy_2.wav a child plays with Mario coin sounds, followed by someone cheers. a child plays with Mario coin sounds someone cheers
fancy_3.wav two minion chatter with each other, then a train horn sounds in the distance, followed by the champagne popped at the end. a train horn sounds in the distance Minion chatter wuth each other the champagne has been popped
fancy_4.wav someone close the door, then a train passes by and a cat knocks a glass off the table. a train passes by a cat knocks a glass off the table the sound of a closing door
fancy_5.wav the doorbell rings, and a mother is playing with baby, followed by a Mario jump sound. a mother is playing with baby the doorbell rings a Mario jump sound
fancy_6.wav an airplane flies overhead, followed by someone running and taking photos. an airplane flies overhead someone running a camera shutter clicks
fancy_7.wav the microwave dings, then a Pikachu thunderbolt effect plays, and finally someone closes the door. someone closes the door a Pikachu thunderbolt effect plays the microwave dings
fancy_8.wav footsteps clatter on tile, a T-Rex roars and a cat hisses in surprise. footsteps clatter on tile a T-Rex roars a cat hisses in surprise
fancy_9.wav horror music plays and a car dring through, followed by a Godzilla roar. horror music plays a Godzilla roar a car dring through
fancy_10.wav a bird is achirping outside while someone closes the door, followed by a child cries. a child cries someone closes the door a bird chirps outside
fancy_11.wav it's raining heavily outside, then an umbrella snaps open, followed by a distant car horn honking twice. heavy rain outside an umbrella snaps open a distant car horn honks twice
fancy_12.wav a baby is smiling happily, a Transformer changes into a car, and start the engine. a baby smiles happily a Transformer changes into a car start the engine
fancy_13.wav a car drives by fast and a Minion make the ambluence sounds, with small bid chrips. a small bird chirps a Minion make the ambluence sounds a car drives by quickly
fancy_14.wav the audience starts to clap hands, then a sudden gunshot is heard, followed by a Street Fighter KO sound. a sudden gunshot a Street Fighter KO sound the audience clap hands
fancy_15.wav a bus engine revs outside with a horn, and a dog barking. a bus engine revs outside the bus horn sounds a dogd barking
fancy_16.wav Some lazer gun shootingwith the Avengers theme playing, and then a door is closed. a door is opened the Avengers theme music is heard lazer gun shooting
fancy_17.wav someone uses the keys and closes the door, then a bicycle bell rings outside. someone unlocks and closes the door keys jingle a bicycle bell rings outside
fancy_18.wav someone running down and knock the door, a faint Poke Ball recall whoosh is heard. someone locks the front door a faint Poke Ball recall whoosh is heard someone running down the driveway
fancy_19.wav with some elegant music, the waiter pops the champagne and pours it into glasses. elegant music the waiter pops the champagne champagne is poured
fancy_20.wav someone flicks a lighter, followed by a firework blast. flicks a lighter a firework blast
fancy_21.wav a gozila is fighting with exciting music, followed by minions talk a gozila is fighting minions talk exciting music
fancy_22.wav a battle music with some Star Wars laser gun sounds, and then the planet explodes. battle music Star Wars laser gun sounds the planet explodes


Demos on AudioCaps

Text Query
Ground Truth
DreamAudio
"rustling pigeons coo"


"a vehicle engine revving as a crowd of people talk"


Text Query
Ground Truth
DreamAudio
"water splashing and trickling as wind blows into a microphone while a man speaks over a radio"


"a dog barking as a man is talking while birds chirp and wind blows into a microphone"


"A sound of vibrating motor"


"an aircraft engine running then slowing down after a plastic click"


"a man is speaking followed by a tap and motorcycle turning on"


"some child speaking in the distant and a toilet flushing"


"ocean waves crashing as a man talks in the distance and wind heavily blows into a microphone"


"a man gives a speech followed by applause"




Acknowledgement

This research was partly supported by a research scholarship from the China Scholarship Council (CSC), funded by British Broadcasting Corporation Research and Development (BBC R&D), Engineering and Physical Sciences Research Council~(EPSRC) Grant EP/T019751/1 'AI for Sound', and a PhD scholarship from the Centre for Vision, Speech and Signal Processing(CVSSP), University of Surrey. For the purpose of open access, the authors have applied a Creative Commons Attribution(CC BY) license to any Author Accepted Manuscript version arising.




Page updated on 8 September 2025