This is an improvement of Min Sang Kim's implementation of QANet that integrates deep contextualized word embeddings (ELMo). Check out his blog here and the original GitHub repository here!
Kim's best model reaches EM/F1 = 70.8/80.1 in 60k steps (6~8 hours) on an NVIDIA P100. With ELMo, this current model reaches EM/F1 = 75.3/83.5 (without hyperparameter tuning) in about 12 hours of training on an NVIDIA V100. Detailed results are listed below.
The dataset used for this task is Stanford Question Answering Dataset. Pretrained GloVe embeddings obtained from common crawl with 840B tokens used for words. Pretrained ELMo embeddings obtained from 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008-2012 (3.6B).
- Python>=2.7
- NumPy
- tqdm
- TensorFlow>=1.5
- spacy==2.0.9
- bottle (only for demo)
To download and preprocess the data, run
# download SQuAD and Glove
sh download.sh
# preprocess the data
python config.py --mode prepro
Just like R-Net by HKUST-KnowComp, hyper parameters are stored in config.py. To debug/train/test/demo, run
python config.py --mode debug/train/test/demo
To evaluate the model with the official code, run
python evaluate-v1.1.py ~/data/squad/dev-v1.1.json train/{model_name}/answer/answer.json
The default directory for the tensorboard log file is train/{model_name}/event
To build the Docker image (requires nvidia-docker), run
nvidia-docker build -t tensorflow/qanet .
Set volume mount paths and port mappings (for demo mode)
export QANETPATH={/path/to/cloned/QANet}
export CONTAINERWORKDIR=/home/QANet
export HOSTPORT=8080
export CONTAINERPORT=8080
bash into the container
nvidia-docker run -v $QANETPATH:$CONTAINERWORKDIR -p $HOSTPORT:$CONTAINERPORT -it --rm tensorflow/qanet bash
Once inside the container, follow the commands provided above starting with downloading the SQuAD and Glove datasets.
- The model adopts character level convolution - max pooling - highway network for input representations similar to this paper by Yoon Kim.
- The encoder consists of positional encoding - depthwise separable convolution - self attention - feed forward structure with layer norm in between.
- Despite the original paper using 200, we observe that using a smaller character dimension leads to better generalization.
- For regularization, a dropout of 0.1 is used every 2 sub-layers and 2 blocks.
- Stochastic depth dropout is used to drop the residual connection with respect to increasing depth of the network as this model heavily relies on residual connections.
- Query-to-Context attention is used along with Context-to-Query attention, which seems to improve the performance more than what the paper reported. This may be due to the lack of diversity in self attention due to 1 head (as opposed to 8 heads) which may have repetitive information that the query-to-context attention contains.
- Learning rate increases from 0.0 to 0.001 in the first 1000 steps in inverse exponential scale and fixed to 0.001 from 1000 steps.
- At inference, this model uses shadow variables maintained by the exponential moving average of all global variables.
- This model uses a training / testing / preprocessing pipeline from R-Net for improved efficiency.
- Deep contextualized word representations are computed at runtime from character-level inputs, and are concatenated to existing char- and word-level embeddings.
Here are the collected results from this repository and the original paper.
Model | Training Steps | Size | Attention Heads | Data Size (aug) | EM | F1 |
---|---|---|---|---|---|---|
Kim's model | 35,000 | 96 | 1 | 87k (no aug) | 69.0 | 78.6 |
Kim's model | 60,000 | 96 | 1 | 87k (no aug) | 70.4 | 79.6 |
Kim's model ( reported by @jasonbw) | 60,000 | 128 | 1 | 87k (no aug) | 70.7 | 79.8 |
Kim's model ( reported by @chesterkuo) | 60,000 | 128 | 8 | 87k (no aug) | 70.8 | 80.1 |
My model | 45,000 | 96 | 1 | 87k (no aug) | 73.5 | 83.5 |
Original Paper | 35,000 | 128 | 8 | 87k (no aug) | NA | 77.0 |
Original Paper | 150,000 | 128 | 8 | 87k (no aug) | 73.6 | 82.7 |
Original Paper | 340,000 | 128 | 8 | 240k (aug) | 75.1 | 83.8 |
Run tensorboard for visualisation.
$ tensorboard --logdir=./