|
MorphAdorner requires training data for the part of speech
taggers. The training data consists of a utf-8 file containing
tab-separated columns. Each input line contains entries
corresponding to a single token (spelling, symbol, or
punctuation mark) in the training text.
- The word ID. (Not needed, but helpful.)
- The original token (spelling).
- The NUPOS part of speech.
- The lemma.
- The standardized spelling.
For some purposes we generate a derived version of the
training data without the first column (the word ID).
Creating training data
Normally we generate training data as follows.
-
We MorphAdorn a suitable set of XML texts and adorn them using
existing training data. The existing training data is chosen to
be consonant in age with the new training texts.
-
The MorphAdorned XML is converted to verticalized tabular form
using the
XMLToTab
utility.
-
We import the verticalized text into a database, spreadsheet,
or column-aware editor to correct the initial tagging.
-
We export the corrected verticalized text into a tabular format
text file containing the five columns listed above.
-
We run programs which check for various kind of inconsistencies
(obviously mismatched parts of speech and lemmata, etc.) and
produce a corrected tabular file. Part of this process includes
updating the MorphAdorner definitions of the NUPOS parts of
speech when new ones appear in the training data.
-
Rinse and repeat these steps until
the training data is free of obvious errors.
Here are some of the checks we typically perform.
Make sure each input line has entries for each of the fields
listed above.
Convert certain XML entity references to unicode
characters. For example, the left double quote specification
"“" is converted to unicode "\u201C".
Make sure the part
of speech tag for each spelling appears in the list of known
NUPOS tags. Unknown tags may be valid but not yet recognized by
MorphAdorner.
Make sure the number of part of speech tags and
lemmata matches for each spelling.
Compile a list of all words
marked with the "zz" (unknown) part of speech tag for further
review.
Look for mismatches between punctuation as a token and
the part of speech. A punctuation mark should have itself as
its part of speech tag.
Look for errors that have appeared in
the past, such as apparent possessive words ending in "'s" that
are marked as adjectives, etc.
Check that both the spelling
and the standard spelling are capitalized for proper nouns. A
few proper nouns are legitimately lower case, but they are rare.
Look for "I" marked with the "z-sy" part of speech. Some of
these are legitimate, but some have been erroneously marked in
the past.
Check a list of previously encountered errors and
correct them if found. Example: the the part of speech tag is
"vbzx" and the lemma is "it|be", change the part of speech tag
to "pn31|vbzx".
Updating the lemmatizer
The training data provides lemmata for the spellings in the
training data. For spellings not in the training data, the
English lemmatizer is used. The English lemmatizer uses a list
of rules and a list of exceptions to lemmatize a spelling given
a major word class. New training data may indicate the need for
new rules or exception list entries.
Creating the lexicons
Once the training data is corrected, it is converted to the
format required by the MorphAdorner
CreateLexicon
utility.
CreateLexicon creates the word and suffix lexicons from the
training data. By convention the word lexicon file name takes
the form {corpusname}lexicon.lex and the associated suffix
lexicon takes the form {corpusname}suffixlexicon.lex .
Normally we want to merge the word lexicon produced from the
training data with other word lists such as common Latin and
French words, proper person and place names, and so on. These
auxiliary word lists will not have frequency information, just
part of speech information. For these auxiliary word lists we
use the Brill lexicon format, which contains the spelling
followed by a list of its possible parts of speech. The
MergeBrillLexicon
utility merges a word list in Brill format with a MorphAdorner lexicon.
Brill lexicon entries are added with occurrence frequencies of 1.
The MergeWordLists
utility is helpful in merging Brill lexicons
as well as other types of word lists.
Generating probability transition matrices
The bigram and trigram part of speech taggers use a Hidden
Markov Model approach to tagging, which requires information
about the transition probabilities from one part of speech to
another. The
NGramTaggerTrainer
utility generates the frequency
entries required to compute the transition probabilities.
By convention, the ngram tagger transition matrix data file
names take the form {corpusname}transmat.mat extension.
Spelling maps
MorphAdorner's spelling standardizers use a variety of rules and
heuristics to map obsolete or variant spellings to standard
spellings.
An important part of the spelling standardization process is the
creation of the spelling map files. These contain one variant
and standard spelling pair per line, separated by a tab
character. By convention spelling maps take file names of the
form {corpusname}mergedspellings.tab .
Some variant spellings (e.g., bee, doe) take different standard
forms depending upon the word class of the original spelling. In
addition to the main spelling map, a subsidiary map specifies
different standardized spellings for variants depending upon
word class.
See
spelling map file formats for more information.
|