NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Creating A Lexicon

CreateLexicon creates word and suffix lexicons from training data.

Usage:

createlexicon trainingdata.tab wordlexicon.lex suffixlexicon.lex maxsuffixlength maxsuffixcount

where

  • trainingdata.tab specifies the name of the file containing the part of speech training data from which the word lexicon and suffix lexicon are built.

    The word lexicon contains each spelling (and standard spellings if provided), the count for each spelling, the parts of speech for each spelling, the counts for each part of speech for each spelling, and the lemma for each part of speech for each spelling (if provided). The suffix lexicon contains a list of suffixes, their counts, and the parts of speech associated with each suffix and the count of each part of speech. Lemmata are stored as a "*' in the suffix lexicon since there are no lemmata for suffixes.

    The training data resides in a utf-8 text file. Each line contains one tab-separated spelling along with its part of speech tag and optionally its lemma and standard spelling in the form:

    spelling {tab} part-of-speech-tag {tag} lemma {tag} standard spelling

    where "{tab}" specifies an Ascii tab character.

    You must specify a spelling and a part of speech tag. The lemma and standard spelling are optional. If you wish to specify a standard spelling without specifying a lemma, enter the lemma as "*".

    Blanks lines are used to separate sentences. While the blank lines are not needed for creating the lexicon, they are needed for creating probability transition matrices and for part of speech tagging.

    The lexicon is built using both the spelling and the standard spelling (when provided). The lemma is also stored when present.

  • wordlexicon.lex specifies the name of the output file to receive the word lexicon.

  • suffixlexicon.lex specifies the name of the output file to receive tthe suffix lexicon.

  • maxsuffixlength specifies the maximum length suffix generated for the suffix lexicon. The default is 6 characters.

  • maxsuffixcount specifies the maximum number of times a spelling must appear in order for its suffix to be added to the suffix lexicon. The default is to include all words regardless of count.

    For some applications you may want to restrict the suffix lexicon to contain suffixes only for infrequently occurring words. Values of 10 (only include spellings which appear 10 or less times in the training data) or 1 (only include spellings which appear once in the training data) are popular choices.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk