DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Yi Yuan¹, Xubo Liu¹, Haohe Liu¹, Xiyuan Kang¹, Zhuo Chen²,
Yuxuan Wang², Mark D. Plumbley¹, Wenwu Wang¹

¹CVSSP, University of Surrey, Guildford, UK ²Bytedance, Shenzhen, China

Abstract

With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short on precisely controlling fine-grained acoustic characteristics of specific sounds. As a result, users that need specific sound content may find it challenging to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation~(CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the customized systems. The experiments show that the proposed model, DreamAudio, generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.

Customized Text-to-Audio Generation

Given only a few audio-language reference pairs $\{(\mathbf{a}_1, \mathbf{t}_1), (\mathbf{a}_2, \mathbf{t}_2), \dots, (\mathbf{a}_k, \mathbf{t}_k)\}$, where each reference audio clip $\mathbf{a}_k$ is associated with a caption $\mathbf{t}_k$. These provided reference concepts can be either user-defined or automatically retrieved audio-caption pairs. Formally, our task is to generate the target audio $\mathbf{a}$ conditioned on the input textual prompt $c_t$ and the reference audios $\mathbf{a}_k$ with their corresponding captions $\mathbf{t}_k$.

Demos on Customized-Concatenation

Text Query	DreamAudio	Tango	ReAudioLDM	AudioGen	MakeAnAudio	Reference 1	Reference 2	Reference 3	Reference 4
"a slap and background noise followed by a bell rings in the background followed by a sound effect and another sound effect"						"a slap and background noise"	"a bell rings in the background"	"a sound effect"	"another sound effect"
"a woman and man are speaking, and a slapping sound is heard along with human sounds and mechanisms"						"a woman and man are speaking, and a slapping sound is heard"

Text Query	Reference 1	Reference 2	Reference 3
"men speaking, a horse neighing, and music, then and are heard followed by boing sounds"	"men speaking and a horse neighing"	"Music"	"Boing sounds"
"the sound of drilling then firework rocket is being fired ,followed by water is hitting a sauna heater"	"the sound of drilling"	"firework rocket is being fired"
"a guitar is playing a note , follow by sound of engine running and people drilling, and then a bike is breaking on concrete"	"a guitar is playing a note"	"sound of engine running"
"children playing then pig bleating then a car horn"	"pig bleating"

Demos on Customized-Overlay

Text Query	DreamAudio	Tango	ReAudioLDM	AudioGen	MakeAnAudio	Reference Base	Reference Front	Reference Back
"a door is being unlocked with a key then an alarm clock chirps and mechanisms sound at the sametime, sound of airplane engine"						"an alarm clock chirps and mechanisms sound"	"a door is being unlocked with a key"	"sound of airplane engine"
"a cymbal is crashing then a woman is singing with noise"						"a woman is singing with noise"	"a cymbal is crashing"

Text Query	Reference Base	Reference Front	Reference Back
"people are drilling in the background, men speak, mechanisms make noise, then a sound is being played"	"men speak, mechanisms make noise"	"people are drilling"	"a sound is being played"
"Instant explosions with food processor blends and people speaking, birds chirping and a child is speaking in the background, someone is hitting a beer bottle with a hammer"	"food processor blends and people speaking, birds chirping and a child is speaking"	"instant explosions"	"someone is hitting a beer bottle with a hammer"
"amen is being repeated in the background, rain is hitting a surface with some sounds of dog bark"	"rain is hitting a surface"	"amen is being repeated"	"some sounds of dog bark"
"jazzy is played while chewing and background noise are heard then dog barking"	"chewing and background noise are heard"	"jazzy is played"	"dog barking"

Demos on Customized-Fantasy

Text Query	DreamAudio	Reference 1	Reference 2	Reference 3
"a gozila is fighting with exciting music, followed by minions talk"		"a gozila is fighting"	"minion talk"	"exciting music"
"with some elegant music, the waiter pops the champagne and pours it into glasses"		"elegant music"	"the waiter pops the champagne"	"champagne is poured"

Text Query	Reference 1	Reference 2	Reference 3
"someone close the door, then a train passes by and a cat knocks a glass off the table"	"a train passes by"	"cat knocks a glass off the table"	"closing door"
"the microwave dings, then a Pikachu thunderbolt effect plays, and finally someone closes the door"	"someone closes the door"	"a Pikachu thunderbolt effect plays"	"the microwave dings"
"horror music plays and a car dring through, followed by a Godzilla roar"	"horror music plays"	"a Godzilla roar"	"a car dring through"
"a car drives by fast and a Minion make the ambluence sounds, with small bid chrips"	"a small bird chirps"	"a Minion make the ambluence sounds"	"a car drives by quickly"
"the audience starts to clap hands, then a sudden gunshot is heard, followed by a Street Fighter KO sound"	"a sudden gunshot"	"a Street Fighter KO sound"	"the audience clap hands"
"a bus engine revs outside with a horn, and a dog barking"	"a bus engine revs outside"	"the bus horn sounds"	"dog barking"
"some lazer gun shootingwith the Avengers theme playing, and then a door is closed"	"a door is opened"	"the Avengers theme music"	"lazer gun shooting"
"someone flicks a lighter, followed by a firework blast"	"flicks a lighter"	"a firework blast"

Customized-Fantasy (Full)

WAV	Caption	Reference 1	Reference 2	Reference 3
fancy_1.wav	a Godzilla roar, and then a faint explosion rumbles in the distance, followed by a small dog barks.	a small dog barks	a Godzilla roar	a faint explosion rumbles in the distance
fancy_2.wav	a child plays with Mario coin sounds, followed by someone cheers.	a child plays with	Mario coin sounds	someone cheers
fancy_3.wav	two minion chatter with each other, then a train horn sounds in the distance, followed by the champagne popped at the end.	a train horn sounds in the distance	Minion chatter wuth each other	the champagne has been popped
fancy_4.wav	someone close the door, then a train passes by and a cat knocks a glass off the table.	a train passes by	a cat knocks a glass off the table	the sound of a closing door
fancy_5.wav	the doorbell rings, and a mother is playing with baby, followed by a Mario jump sound.	a mother is playing with baby	the doorbell rings	a Mario jump sound
fancy_6.wav	an airplane flies overhead, followed by someone running and taking photos.	an airplane flies overhead	someone running	a camera shutter clicks
fancy_7.wav	the microwave dings, then a Pikachu thunderbolt effect plays, and finally someone closes the door.	someone closes the door	a Pikachu thunderbolt effect plays	the microwave dings
fancy_8.wav	footsteps clatter on tile, a T-Rex roars and a cat hisses in surprise.	footsteps clatter on tile	a T-Rex roars	a cat hisses in surprise
fancy_9.wav	horror music plays and a car dring through, followed by a Godzilla roar.	horror music plays	a Godzilla roar	a car dring through
fancy_10.wav	a bird is achirping outside while someone closes the door, followed by a child cries.	a child cries	someone closes the door	a bird chirps outside
fancy_11.wav	it's raining heavily outside, then an umbrella snaps open, followed by a distant car horn honking twice.	heavy rain outside	an umbrella snaps open	a distant car horn honks twice
fancy_12.wav	a baby is smiling happily, a Transformer changes into a car, and start the engine.	a baby smiles happily	a Transformer changes into a car	start the engine
fancy_13.wav	a car drives by fast and a Minion make the ambluence sounds, with small bid chrips.	a small bird chirps	a Minion make the ambluence sounds	a car drives by quickly
fancy_14.wav	the audience starts to clap hands, then a sudden gunshot is heard, followed by a Street Fighter KO sound.	a sudden gunshot	a Street Fighter KO sound	the audience clap hands
fancy_15.wav	a bus engine revs outside with a horn, and a dog barking.	a bus engine revs outside	the bus horn sounds	a dogd barking
fancy_16.wav	Some lazer gun shootingwith the Avengers theme playing, and then a door is closed.	a door is opened	the Avengers theme music is heard	lazer gun shooting
fancy_17.wav	someone uses the keys and closes the door, then a bicycle bell rings outside.	someone unlocks and closes the door	keys jingle	a bicycle bell rings outside
fancy_18.wav	someone running down and knock the door, a faint Poke Ball recall whoosh is heard.	someone locks the front door	a faint Poke Ball recall whoosh is heard	someone running down the driveway
fancy_19.wav	with some elegant music, the waiter pops the champagne and pours it into glasses.	elegant music	the waiter pops the champagne	champagne is poured
fancy_20.wav	someone flicks a lighter, followed by a firework blast.	flicks a lighter	a firework blast
fancy_21.wav	a gozila is fighting with exciting music, followed by minions talk	a gozila is fighting	minions talk	exciting music
fancy_22.wav	a battle music with some Star Wars laser gun sounds, and then the planet explodes.	battle music	Star Wars laser gun sounds	the planet explodes

Demos on AudioCaps

Text Query	Ground Truth	DreamAudio
"rustling pigeons coo"
"a vehicle engine revving as a crowd of people talk"

Text Query	Ground Truth	DreamAudio
"water splashing and trickling as wind blows into a microphone while a man speaks over a radio"
"a dog barking as a man is talking while birds chirp and wind blows into a microphone"
"A sound of vibrating motor"
"an aircraft engine running then slowing down after a plastic click"
"a man is speaking followed by a tap and motorcycle turning on"
"some child speaking in the distant and a toilet flushing"
"ocean waves crashing as a man talks in the distance and wind heavily blows into a microphone"
"a man gives a speech followed by applause"

Acknowledgement

This research was partly supported by a research scholarship from the China Scholarship Council (CSC), funded by British Broadcasting Corporation Research and Development (BBC R&D), Engineering and Physical Sciences Research Council~(EPSRC) Grant EP/T019751/1 'AI for Sound', and a PhD scholarship from the Centre for Vision, Speech and Signal Processing(CVSSP), University of Surrey. For the purpose of open access, the authors have applied a Creative Commons Attribution(CC BY) license to any Author Accepted Manuscript version arising.

Page updated on 8 September 2025

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Abstract

Abstract

Customized Text-to-Audio Generation

Customized Text-to-Audio Generation

Demos on Customized-Concatenation

👉 Click for more examples.

Demos on Customized-Overlay

👉 Click for more examples.

Demos on Customized-Fantasy

👉 Click for more examples.

Customized-Fantasy (Full)

Demos on AudioCaps

👉 Click for more examples.

Acknowledgement