Skip to content

Commit

Permalink
tokenizer sqcodec updated (#8)
Browse files Browse the repository at this point in the history
Co-authored-by: happen <happenmass@gmail.com>
  • Loading branch information
Happenmass and happen authored Sep 30, 2024
1 parent 606ecbc commit 1bcdaf4
Show file tree
Hide file tree
Showing 49 changed files with 2,862 additions and 0 deletions.
64 changes: 64 additions & 0 deletions codec/sqcodec/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Scalar Quantize Audio Codec

([Simplified Chinese](./README_zh.md) | English)

[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

Scalar Quantize Audio Codec is a lightweight audio codec that utilizes scalar quantization algorithms to achieve efficient audio compression and reconstruction. This project aims to provide developers with a simple and extensible audio codec solution. The project code is based on modifications to [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec), replacing the VQ section of the original project with SQ. The algorithm references the paper [SimpleSpeech-2](https://arxiv.org/abs/2408.13893).

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Roadmap](#roadmap)
- [Contribution Guide](#contribution-guide)
- [License](#license)
- [Acknowledgements](#acknowledgements)

## Features

- Implements audio compression using scalar quantization algorithms
- Suitable for Diffusion / Flow Matching audio generation solutions, reducing generation overhead and improving results

## Installation

Follow these steps to install and use this project:

```bash
git clone https://github.com/jingzhunxue/flow_mirror.git
cd flow_mirror/codec/sqcodec
pip install -r requirements.txt
```

## Usage

Coming Soon...

## Roadmap

We are committed to continuously improving and expanding Scalar Quantize Audio Codec to provide more powerful and flexible audio encoding and decoding solutions. Here is our development roadmap:

### October 2024

#### 1.0 - Initial Release

- [x] Complete the basic scalar quantization codec implementation and open-source the code
- [ ] Release 120k hours of mixed pre-trained weights for Chinese and English
- [ ] Publish evaluation results and evaluation code
- [ ] Provide basic documentation and example code

## Contribution Guide

We welcome contributions of all kinds! If you have good ideas or find any issues, please submit an [Issue](https://github.com/jingzhunxue/flow_mirror/issues) or a [Pull Request](https://github.com/jingzhunxue/flow_mirror/pulls).

## License

This project is licensed under the [MIT License](LICENSE).

## Acknowledgements

Special thanks to the following projects and papers for their inspiration and support:

- [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec)
- [SimpleSpeech-2](https://arxiv.org/abs/2408.13893)
64 changes: 64 additions & 0 deletions codec/sqcodec/README_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Scalar Quantize Audio Codec

(简体中文|[English](./README.md))

[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

Scalar Quantize Audio Codec 是一个轻量级的音频编码解码器,采用标量量化算法,实现了高效的音频压缩与还原。该项目旨在为开发者提供一个简单、可扩展的音频编解码解决方案。项目代码基于 [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec) 修改,替换了原项目中的 VQ 部分,算法原理部分参考 [SimpleSpeech-2](https://arxiv.org/abs/2408.13893)

## 目录

- [特性](#特性)
- [安装](#安装)
- [使用方法](#使用方法)
- [Roadmap](#roadmap)
- [贡献指南](#贡献指南)
- [许可证](#许可证)
- [致谢](#致谢)

## 特性

- 基于标量量化的音频压缩算法实现
- 适用于 Diffusion / Flow Matching 等音频生成方案,缓解生成压力,提高生成效果

## 安装

你可以通过以下步骤来安装和使用该项目:

```bash
git clone https://github.com/jingzhunxue/flow_mirror.git
cd flow_mirror/codec/sqcodec
pip install -r requirements.txt
```

## 使用方法

Coming Soon...

## Roadmap

我们致力于不断改进和扩展 Scalar Quantize Audio Codec,以提供更强大和灵活的音频编码解码方案。以下是我们的开发路线图:

### 2024 年 10 月

#### 1.0 - 初始版本发布

- [x] 完成基础标量量化编解码器的实现并开源代码
- [ ] 释放 12 万小时中英文混合预训练权重
- [ ] 公开评估结果及评估代码
- [ ] 提供基础的文档和示例代码

## 贡献指南

我们欢迎任何形式的贡献!如果你有好的想法或发现了问题,请提交 [Issue](https://github.com/jingzhunxue/flow_mirror/issues)[Pull Request](https://github.com/jingzhunxue/flow_mirror/pulls)

## 许可证

该项目使用 [MIT 许可证](LICENSE)

## 致谢

特别感谢以下项目和论文对本项目的启发和支持:

- [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec)
- [SimpleSpeech-2](https://arxiv.org/abs/2408.13893)
6 changes: 6 additions & 0 deletions codec/sqcodec/conf/1gpu.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
$include:
- conf/base.yml

batch_size: 24
val_batch_size: 12
num_workers: 4
3 changes: 3 additions & 0 deletions codec/sqcodec/conf/ablations/baseline.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
$include:
- conf/base.yml
- conf/1gpu.yml
22 changes: 22 additions & 0 deletions codec/sqcodec/conf/ablations/diff-mb.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
$include:
- conf/base.yml
- conf/1gpu.yml

Discriminator.sample_rate: 44100
Discriminator.fft_sizes: [2048, 1024, 512]
Discriminator.bands:
- [0.0, 0.05]
- [0.05, 0.1]
- [0.1, 0.25]
- [0.25, 0.5]
- [0.5, 1.0]


# re-weight lambdas to make up for
# lost discriminators vs baseline
lambdas:
mel/loss: 15.0
adv/feat_loss: 5.0
adv/gen_loss: 1.0
vq/commitment_loss: 0.25
vq/codebook_loss: 1.0
22 changes: 22 additions & 0 deletions codec/sqcodec/conf/ablations/equal-mb.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
$include:
- conf/base.yml
- conf/1gpu.yml

Discriminator.sample_rate: 44100
Discriminator.fft_sizes: [2048, 1024, 512]
Discriminator.bands:
- [0.0, 0.2]
- [0.2, 0.4]
- [0.4, 0.6]
- [0.6, 0.8]
- [0.8, 1.0]


# re-weight lambdas to make up for
# lost discriminators vs baseline
lambdas:
mel/loss: 15.0
adv/feat_loss: 5.0
adv/gen_loss: 1.0
vq/commitment_loss: 0.25
vq/codebook_loss: 1.0
9 changes: 9 additions & 0 deletions codec/sqcodec/conf/ablations/no-adv.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
$include:
- conf/base.yml
- conf/1gpu.yml

lambdas:
mel/loss: 1.0
waveform/loss: 1.0
vq/commitment_loss: 0.25
vq/codebook_loss: 1.0
22 changes: 22 additions & 0 deletions codec/sqcodec/conf/ablations/no-data-balance.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
$include:
- conf/base.yml
- conf/1gpu.yml

train/build_dataset.folders:
speech:
- /data/daps/train
- /data/vctk
- /data/vocalset
- /data/read_speech
- /data/french_speech
- /data/emotional_speech/
- /data/common_voice/
- /data/german_speech/
- /data/russian_speech/
- /data/spanish_speech/
music:
- /data/musdb/train
- /data/jamendo
general:
- /data/audioset/data/unbalanced_train_segments/
- /data/audioset/data/balanced_train_segments/
18 changes: 18 additions & 0 deletions codec/sqcodec/conf/ablations/no-low-hop.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
$include:
- conf/base.yml
- conf/1gpu.yml

MelSpectrogramLoss.n_mels: [80]
MelSpectrogramLoss.window_lengths: [512]
MelSpectrogramLoss.mel_fmin: [0]
MelSpectrogramLoss.mel_fmax: [null]
MelSpectrogramLoss.pow: 1.0
MelSpectrogramLoss.clamp_eps: 1.0e-5
MelSpectrogramLoss.mag_weight: 0.0

lambdas:
mel/loss: 100.0
adv/feat_loss: 2.0
adv/gen_loss: 1.0
vq/commitment_loss: 0.25
vq/codebook_loss: 1.0
17 changes: 17 additions & 0 deletions codec/sqcodec/conf/ablations/no-mb.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
$include:
- conf/base.yml
- conf/1gpu.yml

Discriminator.sample_rate: 44100
Discriminator.fft_sizes: [2048, 1024, 512]
Discriminator.bands:
- [0.0, 1.0]

# re-weight lambdas to make up for
# lost discriminators vs baseline
lambdas:
mel/loss: 15.0
adv/feat_loss: 5.0
adv/gen_loss: 1.0
vq/commitment_loss: 0.25
vq/codebook_loss: 1.0
21 changes: 21 additions & 0 deletions codec/sqcodec/conf/ablations/no-mpd-msd.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
$include:
- conf/base.yml
- conf/1gpu.yml

Discriminator.sample_rate: 44100
Discriminator.rates: []
Discriminator.periods: []
Discriminator.fft_sizes: [2048, 1024, 512]
Discriminator.bands:
- [0.0, 0.1]
- [0.1, 0.25]
- [0.25, 0.5]
- [0.5, 0.75]
- [0.75, 1.0]

lambdas:
mel/loss: 15.0
adv/feat_loss: 2.66
adv/gen_loss: 1.0
vq/commitment_loss: 0.25
vq/codebook_loss: 1.0
21 changes: 21 additions & 0 deletions codec/sqcodec/conf/ablations/no-mpd.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
$include:
- conf/base.yml
- conf/1gpu.yml

Discriminator.sample_rate: 44100
Discriminator.rates: [1]
Discriminator.periods: []
Discriminator.fft_sizes: [2048, 1024, 512]
Discriminator.bands:
- [0.0, 0.1]
- [0.1, 0.25]
- [0.25, 0.5]
- [0.5, 0.75]
- [0.75, 1.0]

lambdas:
mel/loss: 15.0
adv/feat_loss: 2.5
adv/gen_loss: 1.0
vq/commitment_loss: 0.25
vq/codebook_loss: 1.0
22 changes: 22 additions & 0 deletions codec/sqcodec/conf/ablations/only-speech.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
$include:
- conf/base.yml
- conf/1gpu.yml

train/build_dataset.folders:
speech_fb:
- /data/daps/train
speech_hq:
- /data/vctk
- /data/vocalset
- /data/read_speech
- /data/french_speech
speech_uq:
- /data/emotional_speech/
- /data/common_voice/
- /data/german_speech/
- /data/russian_speech/
- /data/spanish_speech/

val/build_dataset.folders:
speech_hq:
- /data/daps/val
Loading

0 comments on commit 1bcdaf4

Please sign in to comment.