Poets that lasting marble seek,
Must carve in Latin or in Greek.
We write in sand, our language grows,
And like the tide, our work o'erflows.

-- Edmund Waller



Northwestern
MorphAdorner
    INFORMATION TECHNOLOGY  
    MorphAdorner Site Map  
MorphAdorner > Documentation > Creating A Suffix Lexicon
 
Home
 
Announcements and News
 
Download MorphAdorner
 
Documentation
 
Licenses
 
Glossary
 
Helpful References
 
Tech Talk
 

Language Recognizer
 
Lemmatizer
 
Lexicon Lookup
 
Name Recognizer
 
Parser
 
Part of Speech Tagger
 
Pluralizer
 
Sentence Splitter
 
Spelling Standardizer
 
Text Segmenter
 
Verb Conjugator
 
Word Tokenizer
 
  Creating A Suffix Lexicon
 
 

CreateSuffixLexicon creates a suffix lexicon from a word lexicon.

Usage:

createsuffixlexicon inputwordlexicon.lex suffixlexicon.lex maxsuffixlength maxsuffixcount

where

  • inputwordlexicon.lex specifies the name of an input word lexicon in MorphAdorner format to receive the word lexicon.

  • suffixlexicon.lex specifies the name of the output file to receive tthe suffix lexicon.

  • maxsuffixlength specifies the maximum length suffix generated for the suffix lexicon. The default is 6 characters.

  • maxsuffixcount specifies the maximum number of times a spelling must appear in order for its suffix to be added to the suffix lexicon. The default is to include all words regardless of count.

    For some applications you may want to restrict the suffix lexicon to contain suffixes only for infrequently occurring words. Values of 10 (only include spellings which appear 10 or less times in the training data) or 1 (only include spellings which appear once in the training data) are popular choices.

The suffix lexicon is used by the part of speech taggers to guess the potential parts of speech for unknown words which do not appear in the word lexicon. For each successively shorter ending substring of the unknown word, the guesser looks up that substring in the suffix lexicon. When the substring exists in the suffix lexicon, the guesser assigns its associated parts of speech to the unknown word.

 

Information Technology | Academic Technologies | Scholarly Technologies 2East Resource Center |
Northwestern Home | Calendar: Plan-It Purple | Sites A-Z | Search
Academic Technologies  NU Library 2East  1970 Campus Drive  Evanston, IL 60208
E-mail: pib@northwestern.edu
Last updated Sun Mar 15 05:52:34 2009   World Wide Web Disclaimer and University Policy Statements   © 2007, 2008 Northwestern University