Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AAC clotho word-tokens recipe #4

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions egs/clotho/aac_word/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Clotho Recipe

## Data preparation
* Data preparation during `stage 0` can be performed by appropriately setting the boolean variables in below command. By default, all variables are set to `false` except for `--augment_audiocaps` which is set to `true`. Description of each variable is also detailed below.

```bash
./run.sh --stage 0 --stop_stage 0 \
--download_clothov2 true \
--download_audiocaps true \
--augment_audiocaps true \
--augment_speedperturbation false \
--download_evalmetrics true
```

#### Setting up Clotho-V2 dataset
* Download and prepare the *Clotho-V2* dataset using below command. This should prepare `data` and `clothov2_data` directories in the current recipe's root directory. The `data` directory should have `{dev,val,eval,recog_val,recog_eval}_clothov2` directories. The `clothov2_data` should have `clotho_{audio,csv}_files` directories.

```bash
./run.sh --stage 0 --stop_stage 0 --download_clothov2 true
```
* Among the `data/{dev,val,eval,recog_val,recog_eval}_clothov2` directories, `dev_clothov2` is used for training, `val_clothov2` is used for validation, and `recog_{val,eval}_clothov2` are used for decoding captions.
* Since each audio sample in this dataset has 5 captions, the `wav.scp` and `text` files in `data/{dev,val,eval}_clothov2` directories contain 5 lines for each audio sample, mapping to its 5 captions.
* To aviod decoding the same audio sample 5 times during the decoding stage, the `wav.scp` and `text` files in `data/recog_{val,eval}_clothov2` directories contain just one line for each audio sample, mapping to its first caption. Additionally, a `groundtruth_captions.txt` file is created in each directory which providing all the 5 ground truth captions for each audio sample.
* The `clothov2_data/clotho_audio_files` directory contains the audio samples from development, validation and evaluation sets, however renamed to `{dev,val,eval}file_{file-ID}.wav` filenames respectively. A mapping of renamed filenames to the original filenames can be found in `data/{dev,val,eval,recog_val,recog_eval}_clothov2/original_filenames.txt`.

#### Setting up AudioCaps dataset
* Download and prepare the *AudioCaps* dataset using below command.

```bash
./run.sh --stage 0 --stop_stage 0 --download_audiocaps true
```
* To augment the *AudioCaps* dataset during stages 1 to 5 (i.e. for generation of features, dict, json, and for training and decoding), please add `--augment_audiocaps true` when executing `./run.sh`.

#### Performing speed perturbation augmentation
* For speed perturbation based data-augmentation, please add `--augment_speedperturbation true` during data preparation.

#### Setting up COCO Evaluation Metrics
* Download and setup the evaluation framework using below command. **WARNING:** This performs `pip3 install scikit-image`

```bash
./run.sh --stage 0 --stop_stage 0 --download_evalmetrics true
```

## Decoding
#### Caption evaluation
* By default, stage 5 decoding evaluates the decoded captions and saves a summary `caption_evaluation_summary.txt` file and a detailed `caption_evaluation_results.txt` file to the experiment's decoding directory (ex: `exp/dev_clothov2_pytorch_train_specaug/decode_recog_val_clothov2_decode_lm_last10/`).
* Alternatively to evaluate the decoded captions, please execute `local/evaluate_decoded_captions.py`. This method takes two inputs: `decoded_json_path`, `groundtruth_captions_path`, and outputs a textfile: `caption_evaluation_results.txt` to the same directory as `decoded_json_path`. This output file tabulates the individual metric scores of each decoded audio sample. An example execution is provided below.

```bash
python local/evaluate_decoded_captions.py \
exp/dev_clothov2_pytorch_train_specaug/decode_recog_val_clothov2_decode_lm_last10/data.json \
data/recog_eval_clothov2/groundtruth_captions.txt
```

#### Using best 10 validation epochs
* By default, stage 5 decoding averages the model parameters saved from the last 10 training epochs. To instead average the model parameters saved from the training epochs with best 10 validation scores, please add `--use_valbest_average true`.
89 changes: 89 additions & 0 deletions egs/clotho/aac_word/cmd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
# Usage: <cmd>.pl [options] JOB=1:<nj> <log> <command...>
# e.g.
# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB
#
# Options:
# --time <time>: Limit the maximum time to execute.
# --mem <mem>: Limit the maximum memory usage.
# -–max-jobs-run <njob>: Limit the number parallel jobs. This is ignored for non-array jobs.
# --num-threads <ngpu>: Specify the number of CPU core.
# --gpu <ngpu>: Specify the number of GPU devices.
# --config: Change the configuration file from default.
#
# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs.
# The left string of "=", i.e. "JOB", is replaced by <N>(Nth job) in the command and the log file name,
# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively.
# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example.
#
# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.
# These options are mapping to specific options for each backend and
# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default.
# If jobs failed, your configuration might be wrong for your environment.
#
#
# The official documentaion for run.pl, queue.pl, slurm.pl, and ssh.pl:
# "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html
# =========================================================~


# Select the backend used by run.sh from "local", "sge", "slurm", or "ssh"
cmd_backend='local'

# Local machine, without any Job scheduling system
if [ "${cmd_backend}" = local ]; then

# The other usage
export train_cmd="run.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="run.pl"
# Used for "*_recog.py"
export decode_cmd="run.pl"

# "qsub" (SGE, Torque, PBS, etc.)
elif [ "${cmd_backend}" = sge ]; then
# The default setting is written in conf/queue.conf.
# You must change "-q g.q" for the "queue" for your environment.
# To know the "queue" names, type "qhost -q"
# Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler.

export train_cmd="queue.pl"
export cuda_cmd="queue.pl"
export decode_cmd="queue.pl"

# "sbatch" (Slurm)
elif [ "${cmd_backend}" = slurm ]; then
# The default setting is written in conf/slurm.conf.
# You must change "-p cpu" and "-p gpu" for the "partion" for your environment.
# To know the "partion" names, type "sinfo".
# You can use "--gpu * " by defualt for slurm and it is interpreted as "--gres gpu:*"
# The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}".

export train_cmd="slurm.pl"
export cuda_cmd="slurm.pl"
export decode_cmd="slurm.pl"

elif [ "${cmd_backend}" = ssh ]; then
# You have to create ".queue/machines" to specify the host to execute jobs.
# e.g. .queue/machines
# host1
# host2
# host3
# Assuming you can login them without any password, i.e. You have to set ssh keys.

export train_cmd="ssh.pl"
export cuda_cmd="ssh.pl"
export decode_cmd="ssh.pl"

# This is an example of specifying several unique options in the JHU CLSP cluster setup.
# Users can modify/add their own command options according to their cluster environments.
elif [ "${cmd_backend}" = jhu ]; then

export train_cmd="queue.pl --mem 2G"
export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/gpu.conf"
export decode_cmd="queue.pl --mem 4G"

else
echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2
return 1
fi
1 change: 1 addition & 0 deletions egs/clotho/aac_word/conf/decode.yaml
4 changes: 4 additions & 0 deletions egs/clotho/aac_word/conf/fbank.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
--sample-frequency=16000
--num-mel-bins=80
--frame-length=64
--frame-shift=32
10 changes: 10 additions & 0 deletions egs/clotho/aac_word/conf/gpu.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Default configuration
command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
option mem=* -l mem_free=$0,ram_free=$0
option mem=0 # Do not add anything to qsub_opts
option num_threads=* -pe smp $0
option num_threads=1 # Do not add anything to qsub_opts
option max_jobs_run=* -tc $0
default gpu=0
option gpu=0
option gpu=* -l 'hostname=b1[12345678]*|c*,gpu=$0' -q g.q
8 changes: 8 additions & 0 deletions egs/clotho/aac_word/conf/lm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
layer: 2
unit: 650
opt: sgd # or adam
sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
batchsize: 64 # batch size in LM training
epoch: 25 # if the data size is large, we can reduce this
patience: 3
maxlen: 100 # if sentence length > lm_maxlen, lm_batchsize is automatically reduced
2 changes: 2 additions & 0 deletions egs/clotho/aac_word/conf/no_preprocess.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
process:
- type: "identity"
3 changes: 3 additions & 0 deletions egs/clotho/aac_word/conf/pitch.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
--sample-frequency=16000
--frame-length=64
--frame-shift=32
10 changes: 10 additions & 0 deletions egs/clotho/aac_word/conf/queue.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Default configuration
command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
option mem=* -l mem_free=$0,ram_free=$0
option mem=0 # Do not add anything to qsub_opts
option num_threads=* -pe smp $0
option num_threads=1 # Do not add anything to qsub_opts
option max_jobs_run=* -tc $0
default gpu=0
option gpu=0
option gpu=* -l gpu=$0 -q g.q
14 changes: 14 additions & 0 deletions egs/clotho/aac_word/conf/slurm.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Default configuration
command sbatch --export=PATH
option name=* --job-name $0
option time=* --time $0
option mem=* --mem-per-cpu $0
option mem=0
option num_threads=* --cpus-per-task $0
option num_threads=1 --cpus-per-task 1
option num_nodes=* --nodes $0
default gpu=0
option gpu=0 -p cpu
option gpu=* -p gpu --gres=gpu:$0 -c $0 # Recommend allocating more CPU than, or equal to the number of GPU
# note: the --max-jobs-run option is supported as a special case
# by slurm.pl and you don't have to handle it in the config file.
16 changes: 16 additions & 0 deletions egs/clotho/aac_word/conf/specaug.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
process:
# these three processes are a.k.a. SpecAugument
- type: "time_warp"
max_time_warp: 5
inplace: true
mode: "PIL"
- type: "freq_mask"
F: 30
n_mask: 2
inplace: true
replace_with_zero: false
- type: "time_mask"
T: 40
n_mask: 2
inplace: true
replace_with_zero: false
1 change: 1 addition & 0 deletions egs/clotho/aac_word/conf/train.yaml
7 changes: 7 additions & 0 deletions egs/clotho/aac_word/conf/tuning/decode.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
batchsize: 0
beam-size: 10
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc-weight: 0.5
lm-weight: 0.3
6 changes: 6 additions & 0 deletions egs/clotho/aac_word/conf/tuning/decode_rnn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
lm-weight: 0.3
beam-size: 20
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc-weight: 0.6
8 changes: 8 additions & 0 deletions egs/clotho/aac_word/conf/tuning/decode_transfromer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
batchsize: 0
beam-size: 10
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc-weight: 0.0
# ctc-weight: 0.5
lm-weight: 0.2
49 changes: 49 additions & 0 deletions egs/clotho/aac_word/conf/tuning/train_conformer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# network architecture
# encoder related
elayers: 12
eunits: 2048
# decoder related
dlayers: 6
dunits: 2048
# attention related
adim: 256
aheads: 4

# hybrid CTC/attention
mtlalpha: 0
# mtlalpha: 0.3

# label smoothing
lsm-weight: 0.1

# minibatch related
batch-size: 64
maxlen-in: 512 # if input length > maxlen-in, batchsize is automatically reduced
maxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced

# optimization related
sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
opt: noam
accum-grad: 1
grad-clip: 5
patience: 0
epochs: 60
dropout-rate: 0.1

# transformer specific setting
backend: pytorch
model-module: "espnet.nets.pytorch_backend.e2e_asr_conformer:E2E"
transformer-input-layer: conv2d # encoder architecture type
transformer-lr: 0.5
transformer-warmup-steps: 25000
transformer-attn-dropout-rate: 0.0
transformer-length-normalized-loss: false
transformer-init: pytorch

# conformer specific setting
transformer-encoder-pos-enc-layer-type: rel_pos
transformer-encoder-selfattn-layer-type: rel_selfattn
transformer-encoder-activation-type: swish
macaron-style: true
use-cnn-module: true
cnn-module-kernel: 15
49 changes: 49 additions & 0 deletions egs/clotho/aac_word/conf/tuning/train_conformer_large.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# network architecture
# encoder related
elayers: 17
eunits: 1024
# decoder related
dlayers: 4
dunits: 1024
# attention related
adim: 512
aheads: 4

# hybrid CTC/attention
mtlalpha: 0
# mtlalpha: 0.3

# label smoothing
lsm-weight: 0.1

# minibatch related
batch-size: 32
maxlen-in: 512 # if input length > maxlen-in, batchsize is automatically reduced
maxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced

# optimization related
sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
opt: noam
accum-grad: 2
grad-clip: 5
patience: 0
epochs: 50
dropout-rate: 0.1

# transformer specific setting
backend: pytorch
model-module: "espnet.nets.pytorch_backend.e2e_asr_conformer:E2E"
transformer-input-layer: conv2d # encoder architecture type
transformer-lr: 1.0
transformer-warmup-steps: 25000
transformer-attn-dropout-rate: 0.0
transformer-length-normalized-loss: false
transformer-init: pytorch

# conformer specific setting
transformer-encoder-pos-enc-layer-type: rel_pos
transformer-encoder-selfattn-layer-type: rel_selfattn
transformer-encoder-activation-type: swish
macaron-style: true
use-cnn-module: true
cnn-module-kernel: 15
Loading