Wrapper of TreeTaggerWrapper is a wrapper to annotate XML files with TreeTagger. The aims of this wrapper are:
- to get always well-formed XML output
- to process files in batch mode
- to load TreeTagger parameters only once to reduce downtime
- to customize TreeTagger behaviour as needed
The wrapper comes with a preprocessing script (pre_treetagger.py
) providing character normalization. And a postprocessing script (post_treetagger.py
) fixing some issues.
This script bundle relies on a modified version of Laurent Pointal's treetaggerwrapper.
├── LICENSE: GPL-3.0
├── README.md: this file
├── post_treetagger.py: script to fix annotation
├── pre_treetagger.py: script to normalize characters
├── requirements.txt: Python dependencies
├── spanish-abbreviations: one token per line, true case and punctuation (TreeTagger expected format)
└── treetagger.py: the wrapper of the wrapper
In order to use the WoTTW
you will need to:
- clone this repository
- install python dependencies
pip install -r requirements.txt
- install TreeTagger
Run python treetagger.py -h
:
usage: treetagger.py [-h] -i INPUT -o OUTPUT -l {en,es,de} [-e ELEMENT]
[-p PATTERN] [-s] [--tokenize] [-a ABBREVIATION]
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
path to the input directory.
-o OUTPUT, --output OUTPUT
path to the output directory.
-l {en,es,de}, --language {en,es,de}
language of the version to be processed.
-e ELEMENT, --element ELEMENT
XML element containing the text to be split in
sentences.
-p PATTERN, --pattern PATTERN
glob pattern to filter files.
-s, --sentence if provided, it splits text in sentences.
--tokenize if provided, it tokenizes the text, else, it expects
one token per line.
-a ABBREVIATION, --abbreviation ABBREVIATION
path to the abbreviation file, if not provided uses
default TreeTagger's abbreviation file.
python treetagger.py -i input/directory/ -o output/directory/ -l es -e s --tokenize -p "*.xml" -a spanish-abbreviations
# preprocess files
python pre_treetagger.py -i input/directory/ -o output/directory/ -t s -g "*.xml"
# tag files with TTWoTTW
python treetagger.py -i input/directory/ -o output/directory/ -l es -e s --tokenize -a spanish-abbreviations -p "*.xml"
# postprocess files
python post_treetagger.py -i input/directory/ -o output/directory/ -l es -g "*.vrt"