tokenizer sqcodec updated (#8)

Co-authored-by: happen <happenmass@gmail.com>
jingzhunxue · Sep 30, 2024 · 1bcdaf4 · 1bcdaf4
1 parent 606ecbc
commit 1bcdaf4
Show file tree

Hide file tree

Showing 49 changed files with 2,862 additions and 0 deletions.
diff --git a/codec/sqcodec/README.md b/codec/sqcodec/README.md
@@ -0,0 +1,64 @@
+# Scalar Quantize Audio Codec
+
+([Simplified Chinese](./README_zh.md) | English)
+
+[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
+
+Scalar Quantize Audio Codec is a lightweight audio codec that utilizes scalar quantization algorithms to achieve efficient audio compression and reconstruction. This project aims to provide developers with a simple and extensible audio codec solution. The project code is based on modifications to [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec), replacing the VQ section of the original project with SQ. The algorithm references the paper [SimpleSpeech-2](https://arxiv.org/abs/2408.13893).
+
+## Table of Contents
+
+- [Features](#features)
+- [Installation](#installation)
+- [Usage](#usage)
+- [Roadmap](#roadmap)
+- [Contribution Guide](#contribution-guide)
+- [License](#license)
+- [Acknowledgements](#acknowledgements)
+
+## Features
+
+- Implements audio compression using scalar quantization algorithms
+- Suitable for Diffusion / Flow Matching audio generation solutions, reducing generation overhead and improving results
+
+## Installation
+
+Follow these steps to install and use this project:
+
+```bash
+git clone https://github.com/jingzhunxue/flow_mirror.git
+cd flow_mirror/codec/sqcodec
+pip install -r requirements.txt
+```
+
+## Usage
+
+Coming Soon...
+
+## Roadmap
+
+We are committed to continuously improving and expanding Scalar Quantize Audio Codec to provide more powerful and flexible audio encoding and decoding solutions. Here is our development roadmap:
+
+### October 2024
+
+#### 1.0 - Initial Release
+
+- [x] Complete the basic scalar quantization codec implementation and open-source the code
+- [ ] Release 120k hours of mixed pre-trained weights for Chinese and English
+- [ ] Publish evaluation results and evaluation code
+- [ ] Provide basic documentation and example code
+
+## Contribution Guide
+
+We welcome contributions of all kinds! If you have good ideas or find any issues, please submit an [Issue](https://github.com/jingzhunxue/flow_mirror/issues) or a [Pull Request](https://github.com/jingzhunxue/flow_mirror/pulls).
+
+## License
+
+This project is licensed under the [MIT License](LICENSE).
+
+## Acknowledgements
+
+Special thanks to the following projects and papers for their inspiration and support:
+
+- [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec)
+- [SimpleSpeech-2](https://arxiv.org/abs/2408.13893)
diff --git a/codec/sqcodec/README_zh.md b/codec/sqcodec/README_zh.md
@@ -0,0 +1,64 @@
+# Scalar Quantize Audio Codec
+
+(简体中文|[English](./README.md))
+
+[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
+
+Scalar Quantize Audio Codec 是一个轻量级的音频编码解码器，采用标量量化算法，实现了高效的音频压缩与还原。该项目旨在为开发者提供一个简单、可扩展的音频编解码解决方案。项目代码基于 [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec) 修改，替换了原项目中的 VQ 部分，算法原理部分参考 [SimpleSpeech-2](https://arxiv.org/abs/2408.13893)。
+
+## 目录
+
+- [特性](#特性)
+- [安装](#安装)
+- [使用方法](#使用方法)
+- [Roadmap](#roadmap)
+- [贡献指南](#贡献指南)
+- [许可证](#许可证)
+- [致谢](#致谢)
+
+## 特性
+
+- 基于标量量化的音频压缩算法实现
+- 适用于 Diffusion / Flow Matching 等音频生成方案，缓解生成压力，提高生成效果
+
+## 安装
+
+你可以通过以下步骤来安装和使用该项目：
+
+```bash
+git clone https://github.com/jingzhunxue/flow_mirror.git
+cd flow_mirror/codec/sqcodec
+pip install -r requirements.txt
+```
+
+## 使用方法
+
+Coming Soon...
+
+## Roadmap
+
+我们致力于不断改进和扩展 Scalar Quantize Audio Codec，以提供更强大和灵活的音频编码解码方案。以下是我们的开发路线图：
+
+### 2024 年 10 月
+
+#### 1.0 - 初始版本发布
+
+- [x] 完成基础标量量化编解码器的实现并开源代码
+- [ ] 释放 12 万小时中英文混合预训练权重
+- [ ] 公开评估结果及评估代码
+- [ ] 提供基础的文档和示例代码
+
+## 贡献指南
+
+我们欢迎任何形式的贡献！如果你有好的想法或发现了问题，请提交 [Issue](https://github.com/jingzhunxue/flow_mirror/issues) 或 [Pull Request](https://github.com/jingzhunxue/flow_mirror/pulls)。
+
+## 许可证
+
+该项目使用 [MIT 许可证](LICENSE)。
+
+## 致谢
+
+特别感谢以下项目和论文对本项目的启发和支持：
+
+- [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec)
+- [SimpleSpeech-2](https://arxiv.org/abs/2408.13893)
diff --git a/codec/sqcodec/conf/1gpu.yml b/codec/sqcodec/conf/1gpu.yml
@@ -0,0 +1,6 @@
+$include:
+  - conf/base.yml
+
+batch_size: 24
+val_batch_size: 12
+num_workers: 4
diff --git a/codec/sqcodec/conf/ablations/baseline.yml b/codec/sqcodec/conf/ablations/baseline.yml
@@ -0,0 +1,3 @@
+$include:
+  - conf/base.yml
+  - conf/1gpu.yml
diff --git a/codec/sqcodec/conf/ablations/diff-mb.yml b/codec/sqcodec/conf/ablations/diff-mb.yml
@@ -0,0 +1,22 @@
+$include:
+  - conf/base.yml
+  - conf/1gpu.yml
+
+Discriminator.sample_rate: 44100
+Discriminator.fft_sizes: [2048, 1024, 512]
+Discriminator.bands:
+  - [0.0, 0.05]
+  - [0.05, 0.1]
+  - [0.1, 0.25]
+  - [0.25, 0.5]
+  - [0.5, 1.0]
+
+
+# re-weight lambdas to make up for
+# lost discriminators vs baseline
+lambdas:
+  mel/loss: 15.0
+  adv/feat_loss: 5.0
+  adv/gen_loss: 1.0
+  vq/commitment_loss: 0.25
+  vq/codebook_loss: 1.0
diff --git a/codec/sqcodec/conf/ablations/equal-mb.yml b/codec/sqcodec/conf/ablations/equal-mb.yml
@@ -0,0 +1,22 @@
+$include:
+  - conf/base.yml
+  - conf/1gpu.yml
+
+Discriminator.sample_rate: 44100
+Discriminator.fft_sizes: [2048, 1024, 512]
+Discriminator.bands:
+  - [0.0, 0.2]
+  - [0.2, 0.4]
+  - [0.4, 0.6]
+  - [0.6, 0.8]
+  - [0.8, 1.0]
+
+
+# re-weight lambdas to make up for
+# lost discriminators vs baseline
+lambdas:
+  mel/loss: 15.0
+  adv/feat_loss: 5.0
+  adv/gen_loss: 1.0
+  vq/commitment_loss: 0.25
+  vq/codebook_loss: 1.0
diff --git a/codec/sqcodec/conf/ablations/no-adv.yml b/codec/sqcodec/conf/ablations/no-adv.yml
@@ -0,0 +1,9 @@
+$include:
+  - conf/base.yml
+  - conf/1gpu.yml
+
+lambdas:
+  mel/loss: 1.0
+  waveform/loss: 1.0
+  vq/commitment_loss: 0.25
+  vq/codebook_loss: 1.0
diff --git a/codec/sqcodec/conf/ablations/no-data-balance.yml b/codec/sqcodec/conf/ablations/no-data-balance.yml
@@ -0,0 +1,22 @@
+$include:
+  - conf/base.yml
+  - conf/1gpu.yml
+
+train/build_dataset.folders:
+  speech:
+    - /data/daps/train
+    - /data/vctk
+    - /data/vocalset
+    - /data/read_speech
+    - /data/french_speech
+    - /data/emotional_speech/
+    - /data/common_voice/
+    - /data/german_speech/
+    - /data/russian_speech/
+    - /data/spanish_speech/
+  music:
+    - /data/musdb/train
+    - /data/jamendo
+  general:
+    - /data/audioset/data/unbalanced_train_segments/
+    - /data/audioset/data/balanced_train_segments/
diff --git a/codec/sqcodec/conf/ablations/no-low-hop.yml b/codec/sqcodec/conf/ablations/no-low-hop.yml
@@ -0,0 +1,18 @@
+$include:
+  - conf/base.yml
+  - conf/1gpu.yml
+
+MelSpectrogramLoss.n_mels: [80]
+MelSpectrogramLoss.window_lengths: [512]
+MelSpectrogramLoss.mel_fmin: [0]
+MelSpectrogramLoss.mel_fmax: [null]
+MelSpectrogramLoss.pow: 1.0
+MelSpectrogramLoss.clamp_eps: 1.0e-5
+MelSpectrogramLoss.mag_weight: 0.0
+
+lambdas:
+  mel/loss: 100.0
+  adv/feat_loss: 2.0
+  adv/gen_loss: 1.0
+  vq/commitment_loss: 0.25
+  vq/codebook_loss: 1.0
diff --git a/codec/sqcodec/conf/ablations/no-mb.yml b/codec/sqcodec/conf/ablations/no-mb.yml
@@ -0,0 +1,17 @@
+$include:
+  - conf/base.yml
+  - conf/1gpu.yml
+
+Discriminator.sample_rate: 44100
+Discriminator.fft_sizes: [2048, 1024, 512]
+Discriminator.bands:
+  - [0.0, 1.0]
+
+# re-weight lambdas to make up for
+# lost discriminators vs baseline
+lambdas:
+  mel/loss: 15.0
+  adv/feat_loss: 5.0
+  adv/gen_loss: 1.0
+  vq/commitment_loss: 0.25
+  vq/codebook_loss: 1.0
diff --git a/codec/sqcodec/conf/ablations/no-mpd-msd.yml b/codec/sqcodec/conf/ablations/no-mpd-msd.yml
@@ -0,0 +1,21 @@
+$include:
+  - conf/base.yml
+  - conf/1gpu.yml
+
+Discriminator.sample_rate: 44100
+Discriminator.rates: []
+Discriminator.periods: []
+Discriminator.fft_sizes: [2048, 1024, 512]
+Discriminator.bands:
+  - [0.0, 0.1]
+  - [0.1, 0.25]
+  - [0.25, 0.5]
+  - [0.5, 0.75]
+  - [0.75, 1.0]
+
+lambdas:
+  mel/loss: 15.0
+  adv/feat_loss: 2.66
+  adv/gen_loss: 1.0
+  vq/commitment_loss: 0.25
+  vq/codebook_loss: 1.0
diff --git a/codec/sqcodec/conf/ablations/no-mpd.yml b/codec/sqcodec/conf/ablations/no-mpd.yml
@@ -0,0 +1,21 @@
+$include:
+  - conf/base.yml
+  - conf/1gpu.yml
+
+Discriminator.sample_rate: 44100
+Discriminator.rates: [1]
+Discriminator.periods: []
+Discriminator.fft_sizes: [2048, 1024, 512]
+Discriminator.bands:
+  - [0.0, 0.1]
+  - [0.1, 0.25]
+  - [0.25, 0.5]
+  - [0.5, 0.75]
+  - [0.75, 1.0]
+
+lambdas:
+  mel/loss: 15.0
+  adv/feat_loss: 2.5
+  adv/gen_loss: 1.0
+  vq/commitment_loss: 0.25
+  vq/codebook_loss: 1.0
diff --git a/codec/sqcodec/conf/ablations/only-speech.yml b/codec/sqcodec/conf/ablations/only-speech.yml
@@ -0,0 +1,22 @@
+$include:
+  - conf/base.yml
+  - conf/1gpu.yml
+
+train/build_dataset.folders:
+  speech_fb:
+    - /data/daps/train
+  speech_hq:
+    - /data/vctk
+    - /data/vocalset
+    - /data/read_speech
+    - /data/french_speech
+  speech_uq:
+    - /data/emotional_speech/
+    - /data/common_voice/
+    - /data/german_speech/
+    - /data/russian_speech/
+    - /data/spanish_speech/
+
+val/build_dataset.folders:
+  speech_hq:
+    - /data/daps/val