trainingdata.tab specifies the name of the file
containing the part of speech training data from which the word
lexicon and suffix lexicon are built.
The word lexicon contains
each spelling (and standard spellings if provided), the count for
each spelling, the parts of speech for each spelling, the counts
for each part of speech for each spelling, and the lemma for each
part of speech for each spelling (if provided). The suffix lexicon
contains a list of suffixes, their counts, and the parts of speech
associated with each suffix and the count of each part of speech.
Lemmata are stored as a "*' in the suffix lexicon since there are no
lemmata for suffixes.
The training data resides in a utf-8 text file. Each line contains
one tab-separated spelling along with its part of speech tag and
optionally its lemma and standard spelling in the form:
spelling {tab} part-of-speech-tag {tag} lemma {tag} standard spelling
where "{tab}" specifies an Ascii tab character.
You must specify a spelling and a part of speech tag. The lemma
and standard spelling are optional. If you wish to specify a
standard spelling without specifying a lemma, enter the lemma as
"*".
Blanks lines are used to separate sentences. While the blank lines
are not needed for creating the lexicon, they are needed for creating
probability transition matrices and for part of speech tagging.
The lexicon is built using both the spelling and the standard
spelling (when provided). The lemma is also stored when present.