Poets that lasting marble seek,
Must carve in Latin or in Greek.
We write in sand, our language grows,
And like the tide, our work o'erflows.

-- Edmund Waller



Northwestern
MorphAdorner
    INFORMATION TECHNOLOGY  
    MorphAdorner Site Map  
MorphAdorner > Documentation > Creating A Lexicon
 
Home
 
Announcements and News
 
Download MorphAdorner
 
Documentation
 
Licenses
 
Glossary
 
Helpful References
 
Tech Talk
 

Language Recognizer
 
Lemmatizer
 
Lexicon Lookup
 
Name Recognizer
 
Parser
 
Part of Speech Tagger
 
Pluralizer
 
Sentence Splitter
 
Spelling Standardizer
 
Text Segmenter
 
Verb Conjugator
 
Word Tokenizer
 
  Creating A Lexicon
 
 

CreateLexicon creates word and suffix lexicons from training data.

Usage:

createlexicon trainingdata.tab wordlexicon.lex suffixlexicon.lex maxsuffixlength maxsuffixcount

where

  • trainingdata.tab specifies the name of the file containing the part of speech training data from which the word lexicon and suffix lexicon are built.

    The word lexicon contains each spelling (and standard spellings if provided), the count for each spelling, the parts of speech for each spelling, the counts for each part of speech for each spelling, and the lemma for each part of speech for each spelling (if provided). The suffix lexicon contains a list of suffixes, their counts, and the parts of speech associated with each suffix and the count of each part of speech. Lemmata are stored as a "*' in the suffix lexicon since there are no lemmata for suffixes.

    The training data resides in a utf-8 text file. Each line contains one tab-separated spelling along with its part of speech tag and optionally its lemma and standard spelling in the form:

    spelling {tab} part-of-speech-tag {tag} lemma {tag} standard spelling

    where "{tab}" specifies an Ascii tab character.

    You must specify a spelling and a part of speech tag. The lemma and standard spelling are optional. If you wish to specify a standard spelling without specifying a lemma, enter the lemma as "*".

    Blanks lines are used to separate sentences. While the blank lines are not needed for creating the lexicon, they are needed for creating probability transition matrices and for part of speech tagging.

    The lexicon is built using both the spelling and the standard spelling (when provided). The lemma is also stored when present.

  • wordlexicon.lex specifies the name of the output file to receive the word lexicon.

  • suffixlexicon.lex specifies the name of the output file to receive tthe suffix lexicon.

  • maxsuffixlength specifies the maximum length suffix generated for the suffix lexicon. The default is 6 characters.

  • maxsuffixcount specifies the maximum number of times a spelling must appear in order for its suffix to be added to the suffix lexicon. The default is to include all words regardless of count.

    For some applications you may want to restrict the suffix lexicon to contain suffixes only for infrequently occurring words. Values of 10 (only include spellings which appear 10 or less times in the training data) or 1 (only include spellings which appear once in the training data) are popular choices.

 

Information Technology | Academic Technologies | Scholarly Technologies 2East Resource Center |
Northwestern Home | Calendar: Plan-It Purple | Sites A-Z | Search
Academic Technologies  NU Library 2East  1970 Campus Drive  Evanston, IL 60208
E-mail: pib@northwestern.edu
Last updated Sun Mar 15 05:52:34 2009   World Wide Web Disclaimer and University Policy Statements   © 2007, 2008 Northwestern University