This is an unofficial repository for the paper, Speak in the Scene:Diffusion-based Acoustic Scene Transfer toward Immersive Speech Generation, Interspeech 2024.
AST: a novel task in generative speech processing, Acoustic Scene Transfer (AST), which aims to transfer acoustic scenes of speech signals to diverse environments.
pip install git+https://github.com/rechawine/Acoustic-Scene-Transfer.git
- Generate audio with audio prompt and content prompt:
python generate.py --audio_prompt "cloned_acoustic_prompt.wav" --cont_prompt "content_prompt.wav" --desc_guidance_scale 9 --cont_guidance_scale 1
Generated audios will be saved at the default output folder ./outputs
.
It's crucial to appropriately adjust the weights for dual classifier-free guidance. We find that this adjustment greatly influences the likelihood of obtaining satisfactory results. Here are some key tips:
-
Some weight settings are more effective for different prompts. Experiment with the weights and find the ideal combination that suits the specific use case.
-
Starting with 7 for both
desc_guidance_scale
andcont_guidance_scale
is a good starting point. -
If you feel that the generated audio doesn't align well with the provided content prompt, try decreasing the
desc_guidance_scale
and increase thecont_guidance_scale
. -
If you feel that the generated audio doesn't align well with the provided description prompt, try decreasing the
cont_guidance_scale
and increase thedesc_guidance_scale
.
View the full list of options with the following command:
python generate.py -h
The CSV files for the processed dataset used to train AST. These files include the transcriptions generated using the Whisper model.
source_wav.csv
(English/Chinese speech/vocal segments from TTS/SVS datasets)
as_noise.csv
(Non-speech segments from AudioSet)source_rir.csv
(RIR from Voicefixer training set)
If you wish to train the model by yourself, follow these steps:
-
Configuration Setup (The trickiest part):
- Navigate to the
configs
folder to find the necessary configuration files. For example,VoiceLDM-M.yaml
is used for training the VoiceLDM-M model in the paper. - Prepare the CSV files used for training. You can download it here.
- Examine the YAML file and adjust the
"paths"
and"noise_paths"
to the root path of your dataset. Also, take a look at the CSV files and ensure that thefile_path
in these CSV files match the actual file path names in your dataset. - Update the paths for
cv_csv_path1
,cv_csv_path2
,as_speech_en_csv_path
,voxceleb_csv_path
,as_noise_csv_path
, andnoise_demand_csv_path
in the YAML file. You may optionally leave it blank if you do not wish to use the corresponding csv file and training data. - You may also adjust other parameters such as the batch size according to your system's capabilities.
- Navigate to the
-
Configure Huggingface Accelerate:
- Set up Accelerate by running:
This will allow support of CPU, single GPU, and multi-GPU training. Follow the on-screen instructions to configure your hardware settings.
accelerate config
- Set up Accelerate by running:
-
Start Training:
- Launch the training process with the following example command:
accelerate launch train.py --config config/VoiceLDM-M.yaml
- Training checkpoints will be automatically saved in the
results
folder.
- Launch the training process with the following example command:
-
Running Inference:
- Once training is complete, you can perform inference using the trained model by specifying the checkpoint path. For example:
python generate.py --ckpt_path results/VoiceLDM-M/checkpoints/checkpoint_49/pytorch_model.bin --desc_prompt "She is talking in a park." --cont_prompt "Good morning! How are you feeling today?"
- Once training is complete, you can perform inference using the trained model by specifying the checkpoint path. For example:
This work would not have been possible without the following repositories: