public interface Lexicon
Each line in the main lexicon file takes the following form:
spelling countspelling pos1 lemma1 countpos1 pos2 lemma2 countpos2 ...
where spelling is the spelling for a word, countspelling is the number of times the spelling appeared in the training data, pos1 is the tag corresponding to the most commonly occurring part of speech for this spelling, lemma1 is the lemma form for this spelling, countpos1 is the number of times the pos1 tag appeared, and pos2, countpos2, etc. are the other possible parts of speech and their counts and lemmata.
The raw counts are stored rather than probabilities so that new training data can be used to update the lexicon easily, and so that individual part of speech taggers can apply different methods of count smoothing.
If lemmata are not available, an "*' should appear in the lemma field.
Modifier and Type | Method and Description |
---|---|
boolean |
containsEntry(java.lang.String entry)
Checks if lexicon contains an entry.
|
java.lang.String[] |
getCategories()
Get the categories, sorted in ascending order.
|
java.util.Set<java.lang.String> |
getCategoriesForEntry(java.util.List<java.lang.String> sentence,
int entryIndex)
Get categories for an entry in a sentence.
|
java.util.Set<java.lang.String> |
getCategoriesForEntry(java.lang.String entry)
Get categories for an entry in the lexicon.
|
java.util.Set<java.lang.String> |
getCategoriesForEntry(java.lang.String entry,
boolean isFirstEntry)
Get categories for an entry.
|
int |
getCategoryCount(java.lang.String category)
Get category count.
|
int |
getCategoryCount(java.lang.String entry,
java.lang.String category)
Get count for an entry in a specific category.
|
java.util.Map<java.lang.String,MutableInteger> |
getCategoryCounts()
Get category counts.
|
java.util.Map<java.lang.String,MutableInteger> |
getCategoryCountsForEntry(java.lang.String entry)
Get category counts for an entry.
|
java.lang.String[] |
getEntries()
Get the entries, sorted in ascending order.
|
int |
getEntryCount(java.lang.String entry)
Get total count for an entry.
|
java.lang.String |
getLargestCategory(java.lang.String entry)
Get category with largest count for an entry.
|
java.lang.String |
getLemma(java.lang.String entry)
Get lemma for an entry.
|
java.lang.String |
getLemma(java.lang.String entry,
java.lang.String category)
Get lemma for an entry in a specific category.
|
java.lang.String[] |
getLemmata(java.lang.String entry)
Get all lemmata for an entry.
|
LexiconEntry |
getLexiconEntry(java.lang.String entry)
Get a lexicon entry.
|
int |
getLexiconSize()
Get number of entries in Lexicon.
|
int |
getLongestEntryLength()
Get the longest entry length in the lexicon.
|
int |
getNumberOfCategories()
Get number of categories.
|
int |
getNumberOfCategoriesForEntry(java.lang.String entry)
Get number of categories for an entry.
|
PartOfSpeechTags |
getPartOfSpeechTags()
Get the part of speech tags list used by the lexicon.
|
int |
getShortestEntryLength()
Get the shortest entry length in the lexicon.
|
void |
loadLexicon(java.net.URL lexiconURL,
boolean compressed,
java.lang.String encoding)
Load entries into a lexicon.
|
void |
loadLexicon(java.net.URL lexiconURL,
java.lang.String encoding)
Load entries into a lexicon.
|
void |
removeEntry(java.lang.String entry)
Remove entry.
|
void |
removeEntryCategory(java.lang.String entry,
java.lang.String category)
Remove given category for an entry.
|
void |
saveLexiconToTextFile(java.lang.String lexiconFileName,
java.lang.String encoding)
Save lexicon to a file.
|
LexiconEntry |
setLexiconEntry(java.lang.String entry,
LexiconEntry entryData)
Set a lexicon entry.
|
boolean |
setPartOfSpeechTags(PartOfSpeechTags partOfSpeechTags)
Set the part of speech tags list used by the lexicon.
|
void |
updateEntryCount(java.lang.String entry,
java.lang.String category,
java.lang.String lemma,
int entryCount)
Update entry count in lexicon for a given category.
|
void loadLexicon(java.net.URL lexiconURL, boolean compressed, java.lang.String encoding) throws java.io.IOException
lexiconURL
- URL for the file containing the lexicon.compressed
- true if lexicon is gzip compressed.encoding
- Character encoding of lexicon file text.java.io.IOException
void loadLexicon(java.net.URL lexiconURL, java.lang.String encoding) throws java.io.IOException
lexiconURL
- URL for the file containing the lexicon.encoding
- Character encoding of lexicon file text.java.io.IOException
void updateEntryCount(java.lang.String entry, java.lang.String category, java.lang.String lemma, int entryCount)
entry
- The entry.category
- The category.lemma
- The lemma.entryCount
- The entry count to add to the current count.
Must be positive.void removeEntryCategory(java.lang.String entry, java.lang.String category)
entry
- The entry.category
- The category to removevoid removeEntry(java.lang.String entry)
entry
- The entry to remove.LexiconEntry getLexiconEntry(java.lang.String entry)
entry
- Entry for which to get lexicon information.Note: this does NOT call the part of speech guesser.
LexiconEntry setLexiconEntry(java.lang.String entry, LexiconEntry entryData)
entry
- Entry for which to get lexicon information.entryData
- The lexicon entry data.int getLexiconSize()
java.lang.String[] getEntries()
java.lang.String[] getCategories()
boolean containsEntry(java.lang.String entry)
entry
- Entry to look up.java.util.Set<java.lang.String> getCategoriesForEntry(java.lang.String entry)
entry
- Entry to look up.java.util.Set<java.lang.String> getCategoriesForEntry(java.lang.String entry, boolean isFirstEntry)
entry
- Entry to look up.isFirstEntry
- True if entry is first in sentence.java.util.Set<java.lang.String> getCategoriesForEntry(java.util.List<java.lang.String> sentence, int entryIndex)
sentence
- List of entries in sentence.entryIndex
- Index within sentence (0-based) of entry.int getNumberOfCategoriesForEntry(java.lang.String entry)
entry
- Entry for which to find number of categories.java.util.Map<java.lang.String,MutableInteger> getCategoryCountsForEntry(java.lang.String entry)
entry
- Entry to look up.java.lang.String getLargestCategory(java.lang.String entry)
entry
- Entry to look up.int getCategoryCount(java.lang.String entry, java.lang.String category)
entry
- Entry to look up.category
- Category for which to retrieve count.java.lang.String getLemma(java.lang.String entry)
entry
- Entry to look up.java.lang.String[] getLemmata(java.lang.String entry)
entry
- Entry to look up.java.lang.String getLemma(java.lang.String entry, java.lang.String category)
entry
- Entry to look up.category
- Category for which to retrieve lemma.int getEntryCount(java.lang.String entry)
entry
- Entry to look up.int getCategoryCount(java.lang.String category)
category
- Get number of times category appears in lexicon.java.util.Map<java.lang.String,MutableInteger> getCategoryCounts()
int getNumberOfCategories()
void saveLexiconToTextFile(java.lang.String lexiconFileName, java.lang.String encoding) throws java.io.IOException
lexiconFileName
- File containing the lexicon.encoding
- Character encoding of lexicon file text.java.io.IOException
PartOfSpeechTags getPartOfSpeechTags()
boolean setPartOfSpeechTags(PartOfSpeechTags partOfSpeechTags)
partOfSpeechTags
- Part of speech tags list.int getLongestEntryLength()
int getShortestEntryLength()