This is the official PyTorch implementation for the following ICSE2025 NIER paper:
Title: UniGenCoder: Merging SEQ2SEQ and SEQ2TREE Paradigms for Unified Code Generation
Our implementation is built on the source code from CodeT5 and Tranx. Thanks for their work.
We recommend readers to refer to envs.yaml
(conda export) or envs.txt
(pip export) for more detailed environment information.
--task | --sub_task | Description |
---|---|---|
code generation | nl-java | text-to-code generation on Concode data |
code translation | cs-java | code-to-code translation from C# to Java |
-
Before starting, please place data in the right position. The correct
data
directory structure should be:data ├── concode ├── train.json ├── dev.json └── test.json
-
Generate the grammar file.
cd UniGenCoder python build_ast.py
-
Prepare the seq2seq teacher and seq2tree teacher and get corresponding best model using checkpoint average:
cd CodeT5 bash sh/fine_tune_concode.sh bash sh/average_model.sh bash sh/fine_tune_concode_tree.sh bash sh/average_model.sh
-
UniGenCoder backbone training:
bash multitask_sh/multitask_distill_cross.sh bash sh/average_model.sh
-
Prepare data for the selector:
bash sh/multitask_distill_cross_inference.sh <!-- modify test_split_tag to prepare for different dataset -->
-
UniGenCoder selector training and inference:
bash multitask_sh/multitask_distill_cross_tune.sh
If you find this code to be useful for your research, please consider citing:
@article{DBLP:journals/corr/abs-2502-12490,
author = {Liangying Shao and
Yanfu Yan and
Denys Poshyvanyk and
Jinsong Su},
title = {UniGenCoder: Merging Seq2Seq and Seq2Tree Paradigms for Unified Code
Generation},
journal = {CoRR},
volume = {abs/2502.12490},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2502.12490},
doi = {10.48550/ARXIV.2502.12490},
eprinttype = {arXiv},
eprint = {2502.12490},
timestamp = {Wed, 19 Mar 2025 11:49:46 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2502-12490.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}