public class CreateLexicon
extends java.lang.Object
java -Xmx512m edu.northwestern.at.morphadorner.tools.createlexicon.CreateLexicon trainingdata
outputwordlexicon outputsuffixlexicon maxsuffixlength maxsuffixcount
trainingdata specifies the name of the file containing the part of speech training data from which the word lexicon and suffix lexicon are built. The word lexicon contains each spelling (and standard spellings if provided), the count for each spelling, the parts of speech for each spelling, the counts for each part of speech for each spelling, and the lemma for each part of speech for each spelling (if provided). The suffix lexicon contains a list of suffixes, their counts, and the parts of speech associated with each suffix and the count of each part of speech. Lemmata are stored as a "*' in the suffix lexicon since there are no lemmata for suffixes.
The training data resides in a utf-8 text file. Each line contains one tab-separated spelling along with its part of speech tag and optionally its lemma and standard spelling in the form:
spelling
You must specify a spelling and a part of speech tag. The lemma and standard spelling are optional. If you wish to specify a standard spelling without specifying a lemma, enter the lemma as "*".
Blank lines are used to separate sentences. While the blank lines are not needed for creating the lexicon, they are needed for creating probability transition matrices and for part of speech tagging.
The lexicon is built using both the spelling and the standard spelling (when provided). The lemma is also stored when present.
outputwordlexicon specifies the name of the output file to receive the word lexicon.
outputsuffixlexicon specifies the name of the output file to receive tthe suffix lexicon.
maxsuffixlength specifies the maximum length suffix generated for the suffix lexicon. The default is 6.
maxsuffixcount specifies the maximum number of times a spelling can appear in order for its suffix to be added to the suffix lexicon. The default is to include all words regardless of count.
For some applications you may want to restrict the suffix lexicon to contain suffixes only for infrequently occurring words. Values of 10 (only include spellings which appear 10 or less times in the training data) or 1 (only include spellings which appear once in the training data) are popular choices.
Modifier and Type | Field and Description |
---|---|
protected static int |
maxSuffixCount
Only use words less than maxSuffixCount to generate
suffix lexicon.
|
protected static int |
maxSuffixLength
Maximum and minimum length suffixes to generated.
|
protected static int |
minSuffixLength |
protected static java.lang.String |
suffixLexiconFileName
Output suffix lexicon file name.
|
protected static java.lang.String |
trainingDataFileName
Training data file name.
|
protected static java.lang.String |
wordLexiconFileName
Output word lexicon file name.
|
Constructor and Description |
---|
CreateLexicon() |
Modifier and Type | Method and Description |
---|---|
protected static void |
help()
Display brief help.
|
protected static boolean |
initialize(java.lang.String[] args)
Initialize.
|
static void |
main(java.lang.String[] args)
Main program.
|
protected static java.lang.String trainingDataFileName
protected static java.lang.String wordLexiconFileName
protected static java.lang.String suffixLexiconFileName
protected static int maxSuffixCount
The default is to use all words regardless of word count.
protected static int maxSuffixLength
protected static int minSuffixLength
protected static void help()
protected static boolean initialize(java.lang.String[] args)
args
- Command line arguments.public static void main(java.lang.String[] args)
args
- Command line arguments.