Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Info for MaskGCT and Vevo #387

Merged
merged 4 commits into from
Jan 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.

## 🚀 News
- **2025/01/23**: [MaskGCT](https://arxiv.org/abs/2409.00750) and [Vevo](https://openreview.net/pdf?id=anQDiQZhDP) got accepted by ICLR 2025! 🎉
- **2024/12/22**: We release the reproduction of **Vevo**, a zero-shot voice imitation framework with controllable timbre and style. Vevo can be applied into a series of speech generation tasks, including VC, TTS, AC, and more. The released pre-trained models are trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieve SOTA zero-shot VC performance. [![arXiv](https://img.shields.io/badge/OpenReview-Paper-COLOR.svg)](https://openreview.net/pdf?id=anQDiQZhDP) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/Vevo) [![WebPage](https://img.shields.io/badge/WebPage-Demo-red)](https://versavoice.github.io/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/vc/vevo/README.md)
- **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieves SOTA zero-shot TTS performance. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2409.00750) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/maskgct) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct) [![ModelScope](https://img.shields.io/badge/ModelScope-space-purple)](https://modelscope.cn/studios/amphion/maskgct) [![ModelScope](https://img.shields.io/badge/ModelScope-model-cyan)](https://modelscope.cn/models/amphion/MaskGCT) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/tts/maskgct/README.md)
- **2024/09/01**: [Amphion](https://arxiv.org/abs/2312.09911), [Emilia](https://arxiv.org/abs/2407.05361) and [DSFF-SVC](https://arxiv.org/abs/2310.11160) got accepted by IEEE SLT 2024! 🤗
Expand Down Expand Up @@ -184,7 +185,7 @@ Amphion is under the [MIT License](LICENSE). It is free for both research and co

```bibtex
@inproceedings{amphion,
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
year={2024}
Expand Down
12 changes: 3 additions & 9 deletions models/tts/debatts/try_inference_small_samples.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,12 +306,8 @@ def semantic2acoustic(combine_semantic_code, acoustic_code):


device = torch.device("cuda:0")
cfg_soundstorm_1layer = load_config(
"./s2a_egs/s2a_debatts_1layer.json"
)
cfg_soundstorm_full = load_config(
"./s2a_egs/s2a_debatts_full.json"
)
cfg_soundstorm_1layer = load_config("./s2a_egs/s2a_debatts_1layer.json")
cfg_soundstorm_full = load_config("./s2a_egs/s2a_debatts_full.json")

soundstorm_1layer = build_soundstorm(cfg_soundstorm_1layer, device)
soundstorm_full = build_soundstorm(cfg_soundstorm_full, device)
Expand All @@ -333,9 +329,7 @@ def semantic2acoustic(combine_semantic_code, acoustic_code):
safetensors.torch.load_model(soundstorm_1layer, soundstorm_1layer_path)
safetensors.torch.load_model(soundstorm_full, soundstorm_full_path)

t2s_cfg = load_config(
"./t2s_egs/t2s_debatts.json"
)
t2s_cfg = load_config("./t2s_egs/t2s_debatts.json")
t2s_model_new = build_t2s_model_new(t2s_cfg, device)
t2s_model_new_ckpt_path = "./t2s_model/model.safetensors"
safetensors.torch.load_model(t2s_model_new, t2s_model_new_ckpt_path)
Expand Down
7 changes: 4 additions & 3 deletions models/tts/debatts/utils/g2p_new/cleaners.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,11 @@
import re
from utils.g2p_new.mandarin import chinese_to_ipa


def cjekfd_cleaners(text, language, text_tokenizers):

if language == 'zh':
return chinese_to_ipa(text, text_tokenizers['zh'])
if language == "zh":
return chinese_to_ipa(text, text_tokenizers["zh"])
else:
raise Exception('Unknown or Not supported yet language: %s' % language)
raise Exception("Unknown or Not supported yet language: %s" % language)
return None
69 changes: 35 additions & 34 deletions models/tts/maskgct/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,12 +132,12 @@ Running this will automatically download the pretrained model from HuggingFace a
We provide the following pretrained checkpoints:


| Model Name | Description |
|-------------------|-------------|
| [Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec) | Converting speech to semantic tokens. |
| [Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec) | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
| [MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model) | Predicting semantic tokens with text and prompt semantic tokens. |
| [MaskGCT-S2A](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model) | Predicts acoustic tokens conditioned on semantic tokens. |
| Model Name | Description |
| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- |
| [Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec) | Converting speech to semantic tokens. |
| [Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec) | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
| [MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model) | Predicting semantic tokens with text and prompt semantic tokens. |
| [MaskGCT-S2A](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model) | Predicts acoustic tokens conditioned on semantic tokens. |

You can download all pretrained checkpoints from [HuggingFace](https://huggingface.co/amphion/MaskGCT/tree/main) or use huggingface API.

Expand Down Expand Up @@ -165,41 +165,42 @@ We use the [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) data

## Evaluation Results of MaskGCT

| System | SIM-O↑ | WER↓ | FSD↓ | SMOS↑ | CMOS↑ |
| :--- | :---: | :---: | :---: | :---: | :---: |
| | | **LibriSpeech test-clean** |
| Ground Truth | 0.68 | 1.94 | | 4.05±0.12 | 0.00 |
| VALL-E | 0.50 | 5.90 | - | 3.47 ±0.26 | -0.52±0.22 |
| VoiceBox | 0.64 | 2.03 | 0.762 | 3.80±0.17 | -0.41±0.13 |
| NaturalSpeech 3 | 0.67 | 1.94 | 0.786 | 4.26±0.10 | 0.16±0.14 |
| VoiceCraft | 0.45 | 4.68 | 0.981 | 3.52±0.21 | -0.33 ±0.16 |
| XTTS-v2 | 0.51 | 4.20 | 0.945 | 3.02±0.22 | -0.98 ±0.19 |
| MaskGCT | 0.687(0.723) | 2.634(1.976) | 0.886 | 4.27±0.14 | 0.10±0.16 |
| MaskGCT(gt length) | 0.697 | 2.012 | 0.746 | 4.33±0.11 | 0.13±0.13 |
| | | **SeedTTS test-en** |
| Ground Truth | 0.730 | 2.143 | | 3.92±0.15 | 0.00 |
| CosyVoice | 0.643 | 4.079 | 0.316 | 3.52±0.17 | -0.41 ±0.18 |
| XTTS-v2 | 0.463 | 3.248 | 0.484 | 3.15±0.22 | -0.86±0.19 |
| VoiceCraft | 0.470 | 7.556 | 0.226 | 3.18±0.20 | -1.08 ±0.15 |
| MaskGCT | 0.717(0.760) | 2.623(1.283) | 0.188 | 4.24 ±0.12 | 0.03 ±0.14 |
| MaskGCT(gt length) | 0.728 | 2.466 | 0.159 | 4.13 ±0.17 | 0.12 ±0.15 |
| | | **SeedTTS test-zh** |
| Ground Truth | 0.750 | 1.254 | | 3.86 ±0.17 | 0.00 |
| CosyVoice | 0.750 | 4.089 | 0.276 | 3.54 ±0.12 | -0.45 ±0.15 |
| XTTS-v2 | 0.635 | 2.876 | 0.413 | 2.95 ±0.18 | -0.81 ±0.22 |
| MaskGCT | 0.774(0.805) | 2.273(0.843) | 0.106 | 4.09 ±0.12 | 0.05 ±0.17 |
| MaskGCT(gt length) | 0.777 | 2.183 | 0.101 | 4.11 ±0.12 | 0.08±0.18 |
| System | SIM-O↑ | WER↓ | FSD↓ | SMOS↑ | CMOS↑ |
| :----------------- | :----------: | :------------------------: | :---: | :--------: | :---------: |
| | | **LibriSpeech test-clean** |
| Ground Truth | 0.68 | 1.94 | | 4.05±0.12 | 0.00 |
| VALL-E | 0.50 | 5.90 | - | 3.47 ±0.26 | -0.52±0.22 |
| VoiceBox | 0.64 | 2.03 | 0.762 | 3.80±0.17 | -0.41±0.13 |
| NaturalSpeech 3 | 0.67 | 1.94 | 0.786 | 4.26±0.10 | 0.16±0.14 |
| VoiceCraft | 0.45 | 4.68 | 0.981 | 3.52±0.21 | -0.33 ±0.16 |
| XTTS-v2 | 0.51 | 4.20 | 0.945 | 3.02±0.22 | -0.98 ±0.19 |
| MaskGCT | 0.687(0.723) | 2.634(1.976) | 0.886 | 4.27±0.14 | 0.10±0.16 |
| MaskGCT(gt length) | 0.697 | 2.012 | 0.746 | 4.33±0.11 | 0.13±0.13 |
| | | **SeedTTS test-en** |
| Ground Truth | 0.730 | 2.143 | | 3.92±0.15 | 0.00 |
| CosyVoice | 0.643 | 4.079 | 0.316 | 3.52±0.17 | -0.41 ±0.18 |
| XTTS-v2 | 0.463 | 3.248 | 0.484 | 3.15±0.22 | -0.86±0.19 |
| VoiceCraft | 0.470 | 7.556 | 0.226 | 3.18±0.20 | -1.08 ±0.15 |
| MaskGCT | 0.717(0.760) | 2.623(1.283) | 0.188 | 4.24 ±0.12 | 0.03 ±0.14 |
| MaskGCT(gt length) | 0.728 | 2.466 | 0.159 | 4.13 ±0.17 | 0.12 ±0.15 |
| | | **SeedTTS test-zh** |
| Ground Truth | 0.750 | 1.254 | | 3.86 ±0.17 | 0.00 |
| CosyVoice | 0.750 | 4.089 | 0.276 | 3.54 ±0.12 | -0.45 ±0.15 |
| XTTS-v2 | 0.635 | 2.876 | 0.413 | 2.95 ±0.18 | -0.81 ±0.22 |
| MaskGCT | 0.774(0.805) | 2.273(0.843) | 0.106 | 4.09 ±0.12 | 0.05 ±0.17 |
| MaskGCT(gt length) | 0.777 | 2.183 | 0.101 | 4.11 ±0.12 | 0.08±0.18 |

## Citations

If you use MaskGCT in your research, please cite the following paper:

```bibtex
@article{wang2024maskgct,
title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
@inproceedings{wang2024maskgct,
author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
journal={arXiv preprint arXiv:2409.00750},
year={2024}
title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
booktitle = {{ICLR}},
publisher = {OpenReview.net},
year = {2025}
}

@inproceedings{amphion,
Expand Down
12 changes: 7 additions & 5 deletions models/vc/vevo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,14 +85,16 @@ Running this will automatically download the pretrained model from HuggingFace a
If you use Vevo in your research, please cite the following papers:

```bibtex
@article{vevo,
title={Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement},
journal={OpenReview},
year={2024}
@inproceedings{vevo,
author = {Xueyao Zhang and Xiaohui Zhang and Kainan Peng and Zhenyu Tang and Vimal Manohar and Yingru Liu and Jeff Hwang and Dangna Li and Yuhao Wang and Julian Chan and Yuan Huang and Zhizheng Wu and Mingbo Ma},
title = {Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement},
booktitle = {{ICLR}},
publisher = {OpenReview.net},
year = {2025}
}

@inproceedings{amphion,
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
year={2024}
Expand Down
Loading