Word Segmentation and Morphological Parsing for Sanskrit
Python Version >= 3.8, <3.11 (for numpy)
This project supports using poetry for dependency management. Follow these instructions to install poetry.
cd TueSan
poetry install
To activate virtual environment, run
poetry shell
./sanskrit/
├── graphml_dev # auxiliary graphml data
| ├── ddd.graphml
│ └── ...
├── final_graphml_train
| ├── ddd.graphml
| └── ...
├── conllu # DCS, from https://github.com/OliverHellwig/sanskrit/tree/master/dcs/data/conllu
| ├── lookup
| | ├── dictionary.csv
| | ├── pos.csv
| | └── word-senses.csv
| └── files
| ├── <subfolders>
| | ├── xxx.conllu
| | └── ...
| ├── xxx.conllu
| └── ...
├── dcs_filtered.json # DCS for task 1, sentences with incomplete annotations are filtered out
├── dcs_processed.pickle # DCS for task 1, with 'sandhied_merged', 'labels', etc.
├── wsmp_train.json # primary data
└── wsmp_dev.json
Data can be accessed from the server, /data/jingwen/sanskrit/
.
- hyperparameter tuning T3 > T1, ray.tune?