NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Training A Tagger

MorphAdorner requires training data for the part of speech taggers. The training data consists of a utf-8 file containing tab-separated columns. Each input line contains entries corresponding to a single token (spelling, symbol, or punctuation mark) in the training text.

  1. The word ID. (Not needed, but helpful.)
  2. The original token (spelling).
  3. The NUPOS part of speech.
  4. The lemma.
  5. The standardized spelling.

For some purposes we generate a derived version of the training data without the first column (the word ID).

Creating training data

Normally we generate training data as follows.

  1. We MorphAdorn a suitable set of XML texts and adorn them using existing training data. The existing training data is chosen to be consonant in age with the new training texts.

  2. The MorphAdorned XML is converted to verticalized tabular form using the XMLToTab utility.

  3. We import the verticalized text into a database, spreadsheet, or column-aware editor to correct the initial tagging.

  4. We export the corrected verticalized text into a tabular format text file containing the five columns listed above.

  5. We run programs which check for various kind of inconsistencies (obviously mismatched parts of speech and lemmata, etc.) and produce a corrected tabular file. Part of this process includes updating the MorphAdorner definitions of the NUPOS parts of speech when new ones appear in the training data.

  6. Rinse and repeat these steps until the training data is free of obvious errors.

Here are some of the checks we typically perform.

  • Make sure each input line has entries for each of the fields listed above.

  • Convert certain XML entity references to unicode characters. For example, the left double quote specification "“" is converted to unicode "\u201C".

  • Make sure the part of speech tag for each spelling appears in the list of known NUPOS tags. Unknown tags may be valid but not yet recognized by MorphAdorner.

  • Make sure the number of part of speech tags and lemmata matches for each spelling.

  • Compile a list of all words marked with the "zz" (unknown) part of speech tag for further review.

  • Look for mismatches between punctuation as a token and the part of speech. A punctuation mark should have itself as its part of speech tag.

  • Look for errors that have appeared in the past, such as apparent possessive words ending in "'s" that are marked as adjectives, etc.

  • Check that both the spelling and the standard spelling are capitalized for proper nouns. A few proper nouns are legitimately lower case, but they are rare.

  • Look for "I" marked with the "z-sy" part of speech. Some of these are legitimate, but some have been erroneously marked in the past.

  • Check a list of previously encountered errors and correct them if found. Example: the the part of speech tag is "vbzx" and the lemma is "it|be", change the part of speech tag to "pn31|vbzx".

Updating the lemmatizer

The training data provides lemmata for the spellings in the training data. For spellings not in the training data, the English lemmatizer is used. The English lemmatizer uses a list of rules and a list of exceptions to lemmatize a spelling given a major word class. New training data may indicate the need for new rules or exception list entries.

Creating the lexicons

Once the training data is corrected, it is converted to the format required by the MorphAdorner CreateLexicon utility.

CreateLexicon creates the word and suffix lexicons from the training data. By convention the word lexicon file name takes the form {corpusname}lexicon.lex and the associated suffix lexicon takes the form {corpusname}suffixlexicon.lex .

Normally we want to merge the word lexicon produced from the training data with other word lists such as common Latin and French words, proper person and place names, and so on. These auxiliary word lists will not have frequency information, just part of speech information. For these auxiliary word lists we use the Brill lexicon format, which contains the spelling followed by a list of its possible parts of speech. The MergeBrillLexicon utility merges a word list in Brill format with a MorphAdorner lexicon.

Brill lexicon entries are added with occurrence frequencies of 1.

The MergeWordLists utility is helpful in merging Brill lexicons as well as other types of word lists.

Generating probability transition matrices

The bigram and trigram part of speech taggers use a Hidden Markov Model approach to tagging, which requires information about the transition probabilities from one part of speech to another. The NGramTaggerTrainer utility generates the frequency entries required to compute the transition probabilities.

By convention, the ngram tagger transition matrix data file names take the form {corpusname}transmat.mat extension.

Spelling maps

MorphAdorner's spelling standardizers use a variety of rules and heuristics to map obsolete or variant spellings to standard spellings.

An important part of the spelling standardization process is the creation of the spelling map files. These contain one variant and standard spelling pair per line, separated by a tab character. By convention spelling maps take file names of the form {corpusname}mergedspellings.tab .

Some variant spellings (e.g., bee, doe) take different standard forms depending upon the word class of the original spelling. In addition to the main spelling map, a subsidiary map specifies different standardized spellings for variants depending upon word class.

See spelling map file formats for more information.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk