MorphAdorner: Creating A Lexicon

Creating A Lexicon

CreateLexicon creates word and suffix lexicons from training data.

Usage:

createlexicon trainingdata.tab wordlexicon.lex suffixlexicon.lex maxsuffixlength maxsuffixcount

where

trainingdata.tab specifies the name of the file containing the part of speech training data from which the word lexicon and suffix lexicon are built.

The word lexicon contains each spelling (and standard spellings if provided), the count for each spelling, the parts of speech for each spelling, the counts for each part of speech for each spelling, and the lemma for each part of speech for each spelling (if provided). The suffix lexicon contains a list of suffixes, their counts, and the parts of speech associated with each suffix and the count of each part of speech. Lemmata are stored as a "*' in the suffix lexicon since there are no lemmata for suffixes.

The training data resides in a utf-8 text file. Each line contains one tab-separated spelling along with its part of speech tag and optionally its lemma and standard spelling in the form:

spelling {tab} part-of-speech-tag {tag} lemma {tag} standard spelling

where "{tab}" specifies an Ascii tab character.

You must specify a spelling and a part of speech tag. The lemma and standard spelling are optional. If you wish to specify a standard spelling without specifying a lemma, enter the lemma as "*".

Blanks lines are used to separate sentences. While the blank lines are not needed for creating the lexicon, they are needed for creating probability transition matrices and for part of speech tagging.

The lexicon is built using both the spelling and the standard spelling (when provided). The lemma is also stored when present.
wordlexicon.lex specifies the name of the output file to receive the word lexicon.
suffixlexicon.lex specifies the name of the output file to receive tthe suffix lexicon.
maxsuffixlength specifies the maximum length suffix generated for the suffix lexicon. The default is 6 characters.
maxsuffixcount specifies the maximum number of times a spelling must appear in order for its suffix to be added to the suffix lexicon. The default is to include all words regardless of count.

For some applications you may want to restrict the suffix lexicon to contain suffixes only for infrequently occurring words. Values of 10 (only include spellings which appear 10 or less times in the training data) or 1 (only include spellings which appear once in the training data) are popular choices.

	Home
	Welcome
	Announcements and News
	Announcements and news about changes to MorphAdorner
	Documentation
	Documentation for using MorphAdorner
	Download MorphAdorner
	Downloading and installing the MorphAdorner client and server software
	Glossary
	Glossary of MorphAdorner terms
	Helpful References
	Natural language processing references
	Licenses
	Licenses for MorphAdorner and Associated Software
	Server
	Online examples of MorphAdorner Server facilities.
	Talks
	Slides from talks about MorphAdorner.
	Tech Talk
	Technical information for programmers using MorphAdorner