|
NGramTaggerTrainer
merges the contents of multiple word list files into a single file.
A word list file contains a list of words, one word on each line.
Usage:
ngramtaggertrainer trainingdata.tab wordlexicon.lex transitionmatrix.mat
where
- trainingdata.tab -- input training data file.
- wordlexicon.lex -- input MorphAdorner lexicon.
- transitionmatrix.mat -- output tag transition matrix file.
The training data file is a tab-separated utf-8 file containing
the part of speech training data generated from the training texts. We
only use the first two columns of the training data.
- The original token (spelling).
- The NUPOS part of speech.
The word lexicon is a MorphAdorner format word lexicon.
The output tag transition file is a utf-8 file containing
the data needed by the MorphAdorner bigram and trigram taggers.
|