-
Notifications
You must be signed in to change notification settings - Fork 0
Training
Lando Löper edited this page Aug 14, 2020
·
4 revisions
To train a new model from scratch you first have to download the training data and finally run the training script.
Please follow these steps to download and preprocess the py150 dataset.
- Dowload and unarchive the parsed AST paths
wget http://files.srl.inf.ethz.ch/data/py150.tar.gz
tar -xzvf py150.tar.gz
- Clone the code2seq repository
git clone https://github.com/Kolkir/code2seq.git
cd code2seq/Python150kExtractor
- Extract the data
python extract.py --data_dir=<PATH_TO_PY150_FOLDER> --output_dir=<PATH_TO_EXTRACTED_FOLDER> --seed=239
- Preprocess the data for training
sh preprocess.sh <PATH_TO_EXTRACTED_FOLDER>
Once you have downloaded and preprocessed the dataset go back this repository.
- Build and run the docker image in a container
docker build -t code-embeddings .
docker run --gpus all --rm -it -v <PATH_TO_EXTRACTED_FOLDER>:/tmp/py150 -p 6006:6006 code-embeddings /bin/bash
- Run the training script
python ./src/train.py \
--dict <PATH_TO_EXTRACTED_DICT> \
--train <PATH_TO_EXTRACTED_TRAIN> \
--test <PATH_TO_EXTRACTED_TEST>
- (Optional) Run tensorboard for better analysis of the training run
tensorboard --logs ./logs