This is a collection of helper classes and functions for training OpenNLP models.
The following properties are supported for creating TrainingParameters
objects:
Parameter | Type | Values | Default |
---|---|---|---|
training.algorithm |
enumeration |
maxent , perceptron |
maxent |
training.iterations |
integer |
100 | |
training.cutoff |
integer |
5 | |
training.threads |
integer |
4 |
The following properties are supported for creating SentenceDetectorFactory
objects:
Parameter | Type | Values | Default |
---|---|---|---|
language |
string |
ISO 639 | en |
abbreviations.dictionary |
path |
||
use.token.end |
boolean |
true , false |
false |
eos.characters |
string |
The following properties are supported for creating TokenizerFactory
objects:
Parameter | Type | Values | Default |
---|---|---|---|
language |
string |
ISO 639 | en |
abbreviations.dictionary |
path |
||
alphanumeric.optimization |
boolean |
true , false |
false |
alphanumeric.pattern |
regex |
^[A-Za-z0-9]+$ |
The following properties are supported for creating POSTaggerFactory
objects:
Parameter | Type | Values | Default |
---|---|---|---|
pos.dictionary |
path |
The applications have the following command line arguments:
TRAINING_FILE OUTPUT_MODEL_FILE PROPERTIES_FILE
TRAINING_FILE OUTPUT_MODEL_FILE
The TRAINING_FILE
has one of the following supported extensions:
conllx
for CoNLL-X treebank files;conllu
for CoNLL-U treebank files.
The OUTPUT_MODEL_FILE
is the name of the file to save the created model to,
e.g. en-tokens.bin
. Any missing directories will be created.
The PROPERTIES_FILE
is a Java properties file (property=value
lines). Each
application supports different sets of properies.
The following properties are supported by the
opennlp.ext.train.app.SentenceModelTrainerAppKt
application to train a
SentenceModel
:
Parameter | Type | Values | Default |
---|---|---|---|
training.algorithm |
enumeration |
maxent , perceptron |
maxent |
training.iterations |
integer |
100 | |
training.cutoff |
integer |
5 | |
training.threads |
integer |
4 | |
language |
string |
ISO 639 | en |
abbreviations.dictionary |
path |
||
use.token.end |
boolean |
true , false |
false |
eos.characters |
string |
||
sentences.per.sample |
integer |
5 | |
upos.tagset |
enumeration |
upos , upenn |
upos |
xpos.tagset |
enumeration |
upos , upenn |
upenn |
The following properties are supported by the
opennlp.ext.train.app.TokeizerModelTrainerAppKt
application to train a
TokenizerModel
:
Parameter | Type | Values | Default |
---|---|---|---|
training.algorithm |
enumeration |
maxent , perceptron |
maxent |
training.iterations |
integer |
100 | |
training.cutoff |
integer |
5 | |
training.threads |
integer |
4 | |
language |
string |
ISO 639 | en |
abbreviations.dictionary |
path |
||
alphanumeric.optimization |
boolean |
true , false |
false |
alphanumeric.pattern |
regex |
^[A-Za-z0-9]+$ |
|
multi.token.words |
enumeration |
split , join |
split |
upos.tagset |
enumeration |
upos , upenn |
upos |
xpos.tagset |
enumeration |
upos , upenn |
upenn |
The following properties are supported by the
opennlp.ext.train.app.POSModelTrainerAppKt
application to train a POSModel
:
Parameter | Type | Values | Default |
---|---|---|---|
training.algorithm |
enumeration |
maxent , perceptron |
maxent |
training.iterations |
integer |
100 | |
training.cutoff |
integer |
5 | |
training.threads |
integer |
4 | |
language |
string |
ISO 639 | en |
pos.dictionary |
path |
||
multi.token.words |
enumeration |
split , join |
split |
pos.tagset |
enumeration |
See pos.tagset. | UPOS |
upos.tagset |
enumeration |
upos , upenn |
upos |
xpos.tagset |
enumeration |
upos , upenn |
upenn |
The following properties are supported by the
opennlp.ext.train.app.LemmatizerModelTrainerAppKt
application to train a
LemmatizerModel
:
Parameter | Type | Values | Default |
---|---|---|---|
training.algorithm |
enumeration |
maxent , perceptron |
maxent |
training.iterations |
integer |
100 | |
training.cutoff |
integer |
5 | |
training.threads |
integer |
4 | |
language |
string |
ISO 639 | en |
multi.token.words |
enumeration |
split , join |
split |
pos.tagset |
enumeration |
See pos.tagset. | UPOS |
upos.tagset |
enumeration |
upos , upenn |
upos |
xpos.tagset |
enumeration |
upos , upenn |
upenn |
The pos.tagset
property can be one of the following enumeration values:
UPOS
, orCPOSTAG
-- The universal/coarse part of speech tag.XPOS
, orPOSTAG
-- The language-specific part of speech tag.UPOS-PTB
, orCPOSTAG-PTB
-- The universal part of speech tag followed by the PTB (Penn TreeBank) part of speech tag separated by a hyphen, e.g.DET-DT
.