public abstract class AbstractLexicon extends IsCloseableObject implements Lexicon, UsesLogger
Each line in the main lexicon file takes the following form:
spelling countspelling pos1 countpos1 pos2 countpos2 ...
where spelling is the spelling for a word, countspelling is the number of times the spelling appeared in the training data, pos1 is the tag corresponding to the most commonly occurring part of speech for this spelling, countpos1 is the number of times the pos1 tag appeared, and pos2, countpos2, etc. are the other possible parts of speech and their counts.
The raw counts are stored rather than probabilities so that new training data can be used to update the lexicon easily, and so that individual part of speech taggers can apply different methods of count smoothing.
Modifier and Type | Field and Description |
---|---|
protected java.util.Map<java.lang.String,MutableInteger> |
categoryCountsMap
Map from part of speech tags to their frequency in the lexicon.
|
protected java.util.Map<java.lang.String,LexiconEntry> |
lexiconMap
Map in which to store lexicon entries.
|
protected Logger |
logger
Logger used for output.
|
protected int |
longestEntryLength
Length (in characters) of the longest entry in the lexicon.
|
protected PartOfSpeechTags |
partOfSpeechTags
Part of Speech tag set used by lexicon.
|
protected int |
shortestEntryLength
Length (in characters) of the shortest entry in the lexicon.
|
protected java.util.Map<java.lang.String,MutableInteger> |
uniqueEntryCountForCategoryMap
Map from part of speech tags to frequency of unique word entries
in the lexicon with each tag.
|
Constructor and Description |
---|
AbstractLexicon()
Create an empty lexicon.
|
Modifier and Type | Method and Description |
---|---|
protected boolean |
checkCategoriesList()
Check that all the tags in the lexicon appear in
the designated part of speech tags list.
|
protected void |
computeUniqueEntryCountsForCategories()
Compute number of lexicon entries for each category.
|
boolean |
containsEntry(java.lang.String entry)
Checks if lexicon contains an entry.
|
java.lang.String[] |
getCategories()
Get the categories, sorted in ascending order.
|
java.util.Set<java.lang.String> |
getCategoriesForEntry(java.util.List<java.lang.String> sentence,
int entryIndex)
Get categories for an entry in a sentence.
|
java.util.Set<java.lang.String> |
getCategoriesForEntry(java.lang.String entry)
Get categories for an entry in the lexicon.
|
java.util.Set<java.lang.String> |
getCategoriesForEntry(java.lang.String entry,
boolean isFirstEntry)
Get categories for an entry.
|
int |
getCategoryCount(java.lang.String category)
Get category count.
|
int |
getCategoryCount(java.lang.String entry,
java.lang.String category)
Get count for an entry in a specific category.
|
java.util.Map<java.lang.String,MutableInteger> |
getCategoryCounts()
Get category counts.
|
java.util.Map<java.lang.String,MutableInteger> |
getCategoryCountsForEntry(java.lang.String entry)
Get category counts for an entry.
|
java.lang.String[] |
getEntries()
Get the entries, sorted in ascending order.
|
int |
getEntryCount(java.lang.String entry)
Get total count for an entry.
|
java.lang.String |
getLargestCategory(java.lang.String entry)
Get category with largest count for an entry.
|
java.lang.String |
getLemma(java.lang.String entry)
Get lemma for an entry.
|
java.lang.String |
getLemma(java.lang.String entry,
java.lang.String category)
Get lemma for an entry in a specific category.
|
java.lang.String[] |
getLemmata(java.lang.String entry)
Get all lemmata for an entry.
|
LexiconEntry |
getLexiconEntry(java.lang.String entry)
Get a lexicon entry.
|
int |
getLexiconSize()
Get number of entries in Lexicon.
|
Logger |
getLogger()
Get the logger.
|
int |
getLongestEntryLength()
Get the longest entry length in the lexicon.
|
int |
getNumberOfCategories()
Get number of categories.
|
int |
getNumberOfCategoriesForEntry(java.lang.String entry)
Get number of categories for an entry.
|
PartOfSpeechTags |
getPartOfSpeechTags()
Get the part of speech tags list used by the lexicon.
|
int |
getShortestEntryLength()
Get the shortest entry length in the lexicon.
|
int |
getUniqueEntryCountForCategory(java.lang.String category)
Get unique entry count for a category.
|
protected void |
incrementUniqueEntryCountForCategory(java.lang.String category)
Increment number of unique entries for a category.
|
void |
loadLexicon(java.net.URL lexiconURL,
boolean compressed,
java.lang.String encoding)
Load entries into a lexicon.
|
void |
loadLexicon(java.net.URL lexiconURL,
java.lang.String encoding)
Load entries into a lexicon.
|
void |
removeEntry(java.lang.String entry)
Remove entry.
|
void |
removeEntryCategory(java.lang.String entry,
java.lang.String category)
Remove given category for an entry.
|
void |
saveLexiconToTextFile(java.lang.String lexiconFileName,
java.lang.String encoding)
Save lexicon to a file.
|
LexiconEntry |
setLexiconEntry(java.lang.String entry,
LexiconEntry entryData)
Set a lexicon entry.
|
void |
setLogger(Logger logger)
Set the logger.
|
boolean |
setPartOfSpeechTags(PartOfSpeechTags partOfSpeechTags)
Set the part of speech tags list used by the lexicon.
|
protected void |
updateCategoryCount(java.lang.String category,
int count)
Add or update category counts map.
|
void |
updateEntryCount(java.lang.String entry,
java.lang.String category,
java.lang.String lemma,
int entryCount)
Update entry count in lexicon for a given category.
|
close
protected java.util.Map<java.lang.String,LexiconEntry> lexiconMap
An entry (e.g., word spelling) is the key, and a LexiconEntry is the value.
protected java.util.Map<java.lang.String,MutableInteger> categoryCountsMap
protected java.util.Map<java.lang.String,MutableInteger> uniqueEntryCountForCategoryMap
protected int longestEntryLength
protected int shortestEntryLength
protected PartOfSpeechTags partOfSpeechTags
Note: all tags in the lexicon must appear in this list!
protected Logger logger
public Logger getLogger()
getLogger
in interface UsesLogger
public void setLogger(Logger logger)
setLogger
in interface UsesLogger
logger
- The logger.protected void updateCategoryCount(java.lang.String category, int count)
category
- Category for which to add/update count.count
- Category count to add to entry.
May be negative.protected void incrementUniqueEntryCountForCategory(java.lang.String category)
category
- Category for which to increment count.public void updateEntryCount(java.lang.String entry, java.lang.String category, java.lang.String lemma, int entryCount)
updateEntryCount
in interface Lexicon
entry
- The entry.category
- The category.lemma
- The lemma.entryCount
- The entry count to add to the current count.
Must be positive.public void removeEntryCategory(java.lang.String entry, java.lang.String category)
removeEntryCategory
in interface Lexicon
entry
- The entry.category
- The category to remove.
If the entry has no remaining categories, the entry is removed from the lexicon.
public void removeEntry(java.lang.String entry)
removeEntry
in interface Lexicon
entry
- The entry to remove.public void loadLexicon(java.net.URL lexiconURL, java.lang.String encoding) throws java.io.IOException
loadLexicon
in interface Lexicon
lexiconURL
- URL for the file containing the lexicon.encoding
- Character encoding of lexicon file text.java.io.IOException
public void loadLexicon(java.net.URL lexiconURL, boolean compressed, java.lang.String encoding) throws java.io.IOException
loadLexicon
in interface Lexicon
lexiconURL
- URL for the file containing the lexicon.compressed
- true if lexicon is gzip compressed.encoding
- Character encoding of lexicon file text.java.io.IOException
protected void computeUniqueEntryCountsForCategories()
public int getLexiconSize()
getLexiconSize
in interface Lexicon
Returns number of fixed entries
public java.lang.String[] getEntries()
getEntries
in interface Lexicon
public java.lang.String[] getCategories()
getCategories
in interface Lexicon
public boolean containsEntry(java.lang.String entry)
containsEntry
in interface Lexicon
entry
- Entry to look up.public LexiconEntry getLexiconEntry(java.lang.String entry)
getLexiconEntry
in interface Lexicon
entry
- Entry for which to get lexicon information.Note: this does NOT call the part of speech guesser.
public LexiconEntry setLexiconEntry(java.lang.String entry, LexiconEntry entryData)
setLexiconEntry
in interface Lexicon
entry
- Entry for which to get lexicon information.entryData
- The lexicon entry data.public java.util.Set<java.lang.String> getCategoriesForEntry(java.lang.String entry)
getCategoriesForEntry
in interface Lexicon
entry
- Entry to look up.public java.util.Set<java.lang.String> getCategoriesForEntry(java.util.List<java.lang.String> sentence, int entryIndex)
getCategoriesForEntry
in interface Lexicon
sentence
- List of entries in sentence.entryIndex
- Index within sentence (0-based) of entry.public java.util.Set<java.lang.String> getCategoriesForEntry(java.lang.String entry, boolean isFirstEntry)
getCategoriesForEntry
in interface Lexicon
entry
- Entry to look up.isFirstEntry
- True if entry is first in sentence.public int getNumberOfCategoriesForEntry(java.lang.String entry)
getNumberOfCategoriesForEntry
in interface Lexicon
entry
- Entry for which to find number of categories.public java.lang.String getLargestCategory(java.lang.String entry)
getLargestCategory
in interface Lexicon
entry
- Entry to look up.public int getCategoryCount(java.lang.String category)
getCategoryCount
in interface Lexicon
category
- Get number of times category appears in lexicon.public int getUniqueEntryCountForCategory(java.lang.String category)
category
- Category.public int getCategoryCount(java.lang.String entry, java.lang.String category)
getCategoryCount
in interface Lexicon
entry
- Entry to look up.category
- Category for which to retrieve count.public java.lang.String getLemma(java.lang.String entry)
getLemma
in interface Lexicon
entry
- Entry to look up.Some spellings may have multiple lemmata depending upon the part of speech. This method returns the lemma associated with the most frequently occurring part of speech.
public java.lang.String[] getLemmata(java.lang.String entry)
getLemmata
in interface Lexicon
entry
- Entry to look up.public java.lang.String getLemma(java.lang.String entry, java.lang.String category)
public java.util.Map<java.lang.String,MutableInteger> getCategoryCounts()
getCategoryCounts
in interface Lexicon
public int getNumberOfCategories()
getNumberOfCategories
in interface Lexicon
public java.util.Map<java.lang.String,MutableInteger> getCategoryCountsForEntry(java.lang.String entry)
getCategoryCountsForEntry
in interface Lexicon
entry
- Entry to look up.public int getEntryCount(java.lang.String entry)
getEntryCount
in interface Lexicon
entry
- Entry to look up.public void saveLexiconToTextFile(java.lang.String lexiconFileName, java.lang.String encoding) throws java.io.IOException
saveLexiconToTextFile
in interface Lexicon
lexiconFileName
- File containing the lexicon.encoding
- Character encoding of lexicon file text.java.io.IOException
public int getLongestEntryLength()
getLongestEntryLength
in interface Lexicon
public int getShortestEntryLength()
getShortestEntryLength
in interface Lexicon
protected boolean checkCategoriesList()
public PartOfSpeechTags getPartOfSpeechTags()
getPartOfSpeechTags
in interface Lexicon
public boolean setPartOfSpeechTags(PartOfSpeechTags partOfSpeechTags)
setPartOfSpeechTags
in interface Lexicon
partOfSpeechTags
- Part of speech tags list.For the check to work, the part of speech tags list should be set after the lexicon is loaded.