Skip to content

Latest commit

 

History

History
129 lines (91 loc) · 5.73 KB

README.md

File metadata and controls

129 lines (91 loc) · 5.73 KB

AST

This is an unofficial repository for the paper, Speak in the Scene:Diffusion-based Acoustic Scene Transfer toward Immersive Speech Generation, Interspeech 2024.

AST: a novel task in generative speech processing, Acoustic Scene Transfer (AST), which aims to transfer acoustic scenes of speech signals to diverse environments.

🔧 Installation

Install directly from GitHub

pip install git+https://github.com/rechawine/Acoustic-Scene-Transfer.git

📖 Usage

  • Generate audio with audio prompt and content prompt:
python generate.py --audio_prompt "cloned_acoustic_prompt.wav" --cont_prompt "content_prompt.wav" --desc_guidance_scale 9 --cont_guidance_scale 1

Generated audios will be saved at the default output folder ./outputs.

💡 Tips for Better Audio Generation

Dual Classifier-Free Guidance Matters!

It's crucial to appropriately adjust the weights for dual classifier-free guidance. We find that this adjustment greatly influences the likelihood of obtaining satisfactory results. Here are some key tips:

  1. Some weight settings are more effective for different prompts. Experiment with the weights and find the ideal combination that suits the specific use case.

  2. Starting with 7 for both desc_guidance_scale and cont_guidance_scale is a good starting point.

  3. If you feel that the generated audio doesn't align well with the provided content prompt, try decreasing the desc_guidance_scale and increase the cont_guidance_scale.

  4. If you feel that the generated audio doesn't align well with the provided description prompt, try decreasing the cont_guidance_scale and increase the desc_guidance_scale.

⚙️ Full List of Options

View the full list of options with the following command:

python generate.py -h

💾 Data

The CSV files for the processed dataset used to train AST. These files include the transcriptions generated using the Whisper model.

Speech Segments

  • source_wav.csv (English/Chinese speech/vocal segments from TTS/SVS datasets)

Non-Speech Segments

  • as_noise.csv (Non-speech segments from AudioSet)
  • source_rir.csv (RIR from Voicefixer training set)

🧠 Training

If you wish to train the model by yourself, follow these steps:

  1. Configuration Setup (The trickiest part):

    • Navigate to the configs folder to find the necessary configuration files. For example, VoiceLDM-M.yaml is used for training the VoiceLDM-M model in the paper.
    • Prepare the CSV files used for training. You can download it here.
    • Examine the YAML file and adjust the "paths" and "noise_paths" to the root path of your dataset. Also, take a look at the CSV files and ensure that the file_path in these CSV files match the actual file path names in your dataset.
    • Update the paths for cv_csv_path1, cv_csv_path2, as_speech_en_csv_path, voxceleb_csv_path, as_noise_csv_path, and noise_demand_csv_path in the YAML file. You may optionally leave it blank if you do not wish to use the corresponding csv file and training data.
    • You may also adjust other parameters such as the batch size according to your system's capabilities.
  2. Configure Huggingface Accelerate:

    • Set up Accelerate by running:
      accelerate config
      This will allow support of CPU, single GPU, and multi-GPU training. Follow the on-screen instructions to configure your hardware settings.
  3. Start Training:

    • Launch the training process with the following example command:
      accelerate launch train.py --config config/VoiceLDM-M.yaml
    • Training checkpoints will be automatically saved in the results folder.
  4. Running Inference:

    • Once training is complete, you can perform inference using the trained model by specifying the checkpoint path. For example:
      python generate.py --ckpt_path results/VoiceLDM-M/checkpoints/checkpoint_49/pytorch_model.bin --desc_prompt "She is talking in a park." --cont_prompt "Good morning! How are you feeling today?" 

🙏 Acknowledgements

This work would not have been possible without the following repositories:

HuggingFace Diffusers

HuggingFace Transformers

VoiceLDM