-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Co-authored-by: happen <happenmass@gmail.com>
- Loading branch information
1 parent
606ecbc
commit 1bcdaf4
Showing
49 changed files
with
2,862 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# Scalar Quantize Audio Codec | ||
|
||
([Simplified Chinese](./README_zh.md) | English) | ||
|
||
[data:image/s3,"s3://crabby-images/cd905/cd905e0a2ca7bdcc1e24610cd29a025951ccf9ef" alt="License"](LICENSE) | ||
|
||
Scalar Quantize Audio Codec is a lightweight audio codec that utilizes scalar quantization algorithms to achieve efficient audio compression and reconstruction. This project aims to provide developers with a simple and extensible audio codec solution. The project code is based on modifications to [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec), replacing the VQ section of the original project with SQ. The algorithm references the paper [SimpleSpeech-2](https://arxiv.org/abs/2408.13893). | ||
|
||
## Table of Contents | ||
|
||
- [Features](#features) | ||
- [Installation](#installation) | ||
- [Usage](#usage) | ||
- [Roadmap](#roadmap) | ||
- [Contribution Guide](#contribution-guide) | ||
- [License](#license) | ||
- [Acknowledgements](#acknowledgements) | ||
|
||
## Features | ||
|
||
- Implements audio compression using scalar quantization algorithms | ||
- Suitable for Diffusion / Flow Matching audio generation solutions, reducing generation overhead and improving results | ||
|
||
## Installation | ||
|
||
Follow these steps to install and use this project: | ||
|
||
```bash | ||
git clone https://github.com/jingzhunxue/flow_mirror.git | ||
cd flow_mirror/codec/sqcodec | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Usage | ||
|
||
Coming Soon... | ||
|
||
## Roadmap | ||
|
||
We are committed to continuously improving and expanding Scalar Quantize Audio Codec to provide more powerful and flexible audio encoding and decoding solutions. Here is our development roadmap: | ||
|
||
### October 2024 | ||
|
||
#### 1.0 - Initial Release | ||
|
||
- [x] Complete the basic scalar quantization codec implementation and open-source the code | ||
- [ ] Release 120k hours of mixed pre-trained weights for Chinese and English | ||
- [ ] Publish evaluation results and evaluation code | ||
- [ ] Provide basic documentation and example code | ||
|
||
## Contribution Guide | ||
|
||
We welcome contributions of all kinds! If you have good ideas or find any issues, please submit an [Issue](https://github.com/jingzhunxue/flow_mirror/issues) or a [Pull Request](https://github.com/jingzhunxue/flow_mirror/pulls). | ||
|
||
## License | ||
|
||
This project is licensed under the [MIT License](LICENSE). | ||
|
||
## Acknowledgements | ||
|
||
Special thanks to the following projects and papers for their inspiration and support: | ||
|
||
- [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec) | ||
- [SimpleSpeech-2](https://arxiv.org/abs/2408.13893) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# Scalar Quantize Audio Codec | ||
|
||
(简体中文|[English](./README.md)) | ||
|
||
[data:image/s3,"s3://crabby-images/cd905/cd905e0a2ca7bdcc1e24610cd29a025951ccf9ef" alt="License"](LICENSE) | ||
|
||
Scalar Quantize Audio Codec 是一个轻量级的音频编码解码器,采用标量量化算法,实现了高效的音频压缩与还原。该项目旨在为开发者提供一个简单、可扩展的音频编解码解决方案。项目代码基于 [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec) 修改,替换了原项目中的 VQ 部分,算法原理部分参考 [SimpleSpeech-2](https://arxiv.org/abs/2408.13893)。 | ||
|
||
## 目录 | ||
|
||
- [特性](#特性) | ||
- [安装](#安装) | ||
- [使用方法](#使用方法) | ||
- [Roadmap](#roadmap) | ||
- [贡献指南](#贡献指南) | ||
- [许可证](#许可证) | ||
- [致谢](#致谢) | ||
|
||
## 特性 | ||
|
||
- 基于标量量化的音频压缩算法实现 | ||
- 适用于 Diffusion / Flow Matching 等音频生成方案,缓解生成压力,提高生成效果 | ||
|
||
## 安装 | ||
|
||
你可以通过以下步骤来安装和使用该项目: | ||
|
||
```bash | ||
git clone https://github.com/jingzhunxue/flow_mirror.git | ||
cd flow_mirror/codec/sqcodec | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## 使用方法 | ||
|
||
Coming Soon... | ||
|
||
## Roadmap | ||
|
||
我们致力于不断改进和扩展 Scalar Quantize Audio Codec,以提供更强大和灵活的音频编码解码方案。以下是我们的开发路线图: | ||
|
||
### 2024 年 10 月 | ||
|
||
#### 1.0 - 初始版本发布 | ||
|
||
- [x] 完成基础标量量化编解码器的实现并开源代码 | ||
- [ ] 释放 12 万小时中英文混合预训练权重 | ||
- [ ] 公开评估结果及评估代码 | ||
- [ ] 提供基础的文档和示例代码 | ||
|
||
## 贡献指南 | ||
|
||
我们欢迎任何形式的贡献!如果你有好的想法或发现了问题,请提交 [Issue](https://github.com/jingzhunxue/flow_mirror/issues) 或 [Pull Request](https://github.com/jingzhunxue/flow_mirror/pulls)。 | ||
|
||
## 许可证 | ||
|
||
该项目使用 [MIT 许可证](LICENSE)。 | ||
|
||
## 致谢 | ||
|
||
特别感谢以下项目和论文对本项目的启发和支持: | ||
|
||
- [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec) | ||
- [SimpleSpeech-2](https://arxiv.org/abs/2408.13893) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
$include: | ||
- conf/base.yml | ||
|
||
batch_size: 24 | ||
val_batch_size: 12 | ||
num_workers: 4 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
$include: | ||
- conf/base.yml | ||
- conf/1gpu.yml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
$include: | ||
- conf/base.yml | ||
- conf/1gpu.yml | ||
|
||
Discriminator.sample_rate: 44100 | ||
Discriminator.fft_sizes: [2048, 1024, 512] | ||
Discriminator.bands: | ||
- [0.0, 0.05] | ||
- [0.05, 0.1] | ||
- [0.1, 0.25] | ||
- [0.25, 0.5] | ||
- [0.5, 1.0] | ||
|
||
|
||
# re-weight lambdas to make up for | ||
# lost discriminators vs baseline | ||
lambdas: | ||
mel/loss: 15.0 | ||
adv/feat_loss: 5.0 | ||
adv/gen_loss: 1.0 | ||
vq/commitment_loss: 0.25 | ||
vq/codebook_loss: 1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
$include: | ||
- conf/base.yml | ||
- conf/1gpu.yml | ||
|
||
Discriminator.sample_rate: 44100 | ||
Discriminator.fft_sizes: [2048, 1024, 512] | ||
Discriminator.bands: | ||
- [0.0, 0.2] | ||
- [0.2, 0.4] | ||
- [0.4, 0.6] | ||
- [0.6, 0.8] | ||
- [0.8, 1.0] | ||
|
||
|
||
# re-weight lambdas to make up for | ||
# lost discriminators vs baseline | ||
lambdas: | ||
mel/loss: 15.0 | ||
adv/feat_loss: 5.0 | ||
adv/gen_loss: 1.0 | ||
vq/commitment_loss: 0.25 | ||
vq/codebook_loss: 1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
$include: | ||
- conf/base.yml | ||
- conf/1gpu.yml | ||
|
||
lambdas: | ||
mel/loss: 1.0 | ||
waveform/loss: 1.0 | ||
vq/commitment_loss: 0.25 | ||
vq/codebook_loss: 1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
$include: | ||
- conf/base.yml | ||
- conf/1gpu.yml | ||
|
||
train/build_dataset.folders: | ||
speech: | ||
- /data/daps/train | ||
- /data/vctk | ||
- /data/vocalset | ||
- /data/read_speech | ||
- /data/french_speech | ||
- /data/emotional_speech/ | ||
- /data/common_voice/ | ||
- /data/german_speech/ | ||
- /data/russian_speech/ | ||
- /data/spanish_speech/ | ||
music: | ||
- /data/musdb/train | ||
- /data/jamendo | ||
general: | ||
- /data/audioset/data/unbalanced_train_segments/ | ||
- /data/audioset/data/balanced_train_segments/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
$include: | ||
- conf/base.yml | ||
- conf/1gpu.yml | ||
|
||
MelSpectrogramLoss.n_mels: [80] | ||
MelSpectrogramLoss.window_lengths: [512] | ||
MelSpectrogramLoss.mel_fmin: [0] | ||
MelSpectrogramLoss.mel_fmax: [null] | ||
MelSpectrogramLoss.pow: 1.0 | ||
MelSpectrogramLoss.clamp_eps: 1.0e-5 | ||
MelSpectrogramLoss.mag_weight: 0.0 | ||
|
||
lambdas: | ||
mel/loss: 100.0 | ||
adv/feat_loss: 2.0 | ||
adv/gen_loss: 1.0 | ||
vq/commitment_loss: 0.25 | ||
vq/codebook_loss: 1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
$include: | ||
- conf/base.yml | ||
- conf/1gpu.yml | ||
|
||
Discriminator.sample_rate: 44100 | ||
Discriminator.fft_sizes: [2048, 1024, 512] | ||
Discriminator.bands: | ||
- [0.0, 1.0] | ||
|
||
# re-weight lambdas to make up for | ||
# lost discriminators vs baseline | ||
lambdas: | ||
mel/loss: 15.0 | ||
adv/feat_loss: 5.0 | ||
adv/gen_loss: 1.0 | ||
vq/commitment_loss: 0.25 | ||
vq/codebook_loss: 1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
$include: | ||
- conf/base.yml | ||
- conf/1gpu.yml | ||
|
||
Discriminator.sample_rate: 44100 | ||
Discriminator.rates: [] | ||
Discriminator.periods: [] | ||
Discriminator.fft_sizes: [2048, 1024, 512] | ||
Discriminator.bands: | ||
- [0.0, 0.1] | ||
- [0.1, 0.25] | ||
- [0.25, 0.5] | ||
- [0.5, 0.75] | ||
- [0.75, 1.0] | ||
|
||
lambdas: | ||
mel/loss: 15.0 | ||
adv/feat_loss: 2.66 | ||
adv/gen_loss: 1.0 | ||
vq/commitment_loss: 0.25 | ||
vq/codebook_loss: 1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
$include: | ||
- conf/base.yml | ||
- conf/1gpu.yml | ||
|
||
Discriminator.sample_rate: 44100 | ||
Discriminator.rates: [1] | ||
Discriminator.periods: [] | ||
Discriminator.fft_sizes: [2048, 1024, 512] | ||
Discriminator.bands: | ||
- [0.0, 0.1] | ||
- [0.1, 0.25] | ||
- [0.25, 0.5] | ||
- [0.5, 0.75] | ||
- [0.75, 1.0] | ||
|
||
lambdas: | ||
mel/loss: 15.0 | ||
adv/feat_loss: 2.5 | ||
adv/gen_loss: 1.0 | ||
vq/commitment_loss: 0.25 | ||
vq/codebook_loss: 1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
$include: | ||
- conf/base.yml | ||
- conf/1gpu.yml | ||
|
||
train/build_dataset.folders: | ||
speech_fb: | ||
- /data/daps/train | ||
speech_hq: | ||
- /data/vctk | ||
- /data/vocalset | ||
- /data/read_speech | ||
- /data/french_speech | ||
speech_uq: | ||
- /data/emotional_speech/ | ||
- /data/common_voice/ | ||
- /data/german_speech/ | ||
- /data/russian_speech/ | ||
- /data/spanish_speech/ | ||
|
||
val/build_dataset.folders: | ||
speech_hq: | ||
- /data/daps/val |
Oops, something went wrong.