edu.northwestern.at.utils.corpuslinguistics.lexicon
Interface Lexicon

All Known Implementing Classes:
AbstractLexicon, BaseLexicon, DefaultLexicon, DefaultSuffixLexicon, DefaultWordLexicon

public interface Lexicon

Lexicon: stores spellings and their possible lemmata and parts of speech.

Each line in the main lexicon file takes the following form:

spelling countspelling pos1 lemma1 countpos1 pos2 lemma2 countpos2 ...

where spelling is the spelling for a word, countspelling is the number of times the spelling appeared in the training data, pos1 is the tag corresponding to the most commonly occurring part of speech for this spelling, lemma1 is the lemma form for this spelling, countpos1 is the number of times the pos1 tag appeared, and pos2, countpos2, etc. are the other possible parts of speech and their counts and lemmata.

The raw counts are stored rather than probabilities so that new training data can be used to update the lexicon easily, and so that individual part of speech taggers can apply different methods of count smoothing.

If lemmata are not available, an "*' should appear in the lemma field.


Method Summary
 boolean containsEntry(java.lang.String entry)
          Checks if lexicon contains an entry.
 java.lang.String[] getCategories()
          Get the categories, sorted in ascending order.
 java.util.Set<java.lang.String> getCategoriesForEntry(java.util.List<java.lang.String> sentence, int entryIndex)
          Get categories for an entry in a sentence.
 java.util.Set<java.lang.String> getCategoriesForEntry(java.lang.String entry)
          Get categories for an entry in the lexicon.
 java.util.Set<java.lang.String> getCategoriesForEntry(java.lang.String entry, boolean isFirstEntry)
          Get categories for an entry.
 int getCategoryCount(java.lang.String category)
          Get category count.
 int getCategoryCount(java.lang.String entry, java.lang.String category)
          Get count for an entry in a specific category.
 java.util.Map<java.lang.String,MutableInteger> getCategoryCounts()
          Get category counts.
 java.util.Map<java.lang.String,MutableInteger> getCategoryCountsForEntry(java.lang.String entry)
          Get category counts for an entry.
 java.lang.String[] getEntries()
          Get the entries, sorted in ascending order.
 int getEntryCount(java.lang.String entry)
          Get total count for an entry.
 java.lang.String getLargestCategory(java.lang.String entry)
          Get category with largest count for an entry.
 java.lang.String getLemma(java.lang.String entry)
          Get lemma for an entry.
 java.lang.String getLemma(java.lang.String entry, java.lang.String category)
          Get lemma for an entry in a specific category.
 java.lang.String[] getLemmata(java.lang.String entry)
          Get all lemmata for an entry.
 LexiconEntry getLexiconEntry(java.lang.String entry)
          Get a lexicon entry.
 int getLexiconSize()
          Get number of entries in Lexicon.
 int getLongestEntryLength()
          Get the longest entry length in the lexicon.
 int getNumberOfCategories()
          Get number of categories.
 int getNumberOfCategoriesForEntry(java.lang.String entry)
          Get number of categories for an entry.
 PartOfSpeechTags getPartOfSpeechTags()
          Get the part of speech tags list used by the lexicon.
 int getShortestEntryLength()
          Get the shortest entry length in the lexicon.
 void loadLexicon(java.net.URL lexiconURL, java.lang.String encoding)
          Load entries into a lexicon.
 void removeEntry(java.lang.String entry)
          Remove entry.
 void removeEntryCategory(java.lang.String entry, java.lang.String category)
          Remove given category for an entry.
 void saveLexiconToTextFile(java.lang.String lexiconFileName, java.lang.String encoding)
          Save lexicon to a file.
 LexiconEntry setLexiconEntry(java.lang.String entry, LexiconEntry entryData)
          Set a lexicon entry.
 boolean setPartOfSpeechTags(PartOfSpeechTags partOfSpeechTags)
          Set the part of speech tags list used by the lexicon.
 void updateEntryCount(java.lang.String entry, java.lang.String category, java.lang.String lemma, int entryCount)
          Update entry count in lexicon for a given category.
 

Method Detail

loadLexicon

void loadLexicon(java.net.URL lexiconURL,
                 java.lang.String encoding)
                 throws java.io.IOException
Load entries into a lexicon.

Parameters:
lexiconURL - URL for the file containing the lexicon.
encoding - Character encoding of lexicon file text.
Throws:
java.io.IOException

updateEntryCount

void updateEntryCount(java.lang.String entry,
                      java.lang.String category,
                      java.lang.String lemma,
                      int entryCount)
Update entry count in lexicon for a given category.

Parameters:
entry - The entry.
category - The category.
lemma - The lemma.
entryCount - The entry count to add to the current count. Must be positive.

removeEntryCategory

void removeEntryCategory(java.lang.String entry,
                         java.lang.String category)
Remove given category for an entry.

Parameters:
entry - The entry.
category - The category to remove

removeEntry

void removeEntry(java.lang.String entry)
Remove entry.

Parameters:
entry - The entry to remove.

getLexiconEntry

LexiconEntry getLexiconEntry(java.lang.String entry)
Get a lexicon entry.

Parameters:
entry - Entry for which to get lexicon information.
Returns:
LexiconEntry for entry, or null if not found.

Note: this does NOT call the part of speech guesser.


setLexiconEntry

LexiconEntry setLexiconEntry(java.lang.String entry,
                             LexiconEntry entryData)
Set a lexicon entry.

Parameters:
entry - Entry for which to get lexicon information.
entryData - The lexicon entry data.
Returns:
Previous lexicon data for entry, if any.

getLexiconSize

int getLexiconSize()
Get number of entries in Lexicon.

Returns:
Number of entries in Lexicon.

getEntries

java.lang.String[] getEntries()
Get the entries, sorted in ascending order.

Returns:
The sorted entry strings as an array of string.

getCategories

java.lang.String[] getCategories()
Get the categories, sorted in ascending order.

Returns:
The sorted category strings as an array of string.

containsEntry

boolean containsEntry(java.lang.String entry)
Checks if lexicon contains an entry.

Parameters:
entry - Entry to look up.
Returns:
true if lexicon contains entry. Only an exact match is considered.

getCategoriesForEntry

java.util.Set<java.lang.String> getCategoriesForEntry(java.lang.String entry)
Get categories for an entry in the lexicon.

Parameters:
entry - Entry to look up.
Returns:
Set of categories. Null if entry not found in lexicon.

getCategoriesForEntry

java.util.Set<java.lang.String> getCategoriesForEntry(java.lang.String entry,
                                                      boolean isFirstEntry)
Get categories for an entry.

Parameters:
entry - Entry to look up.
isFirstEntry - True if entry is first in sentence.
Returns:
Set of categories. Null if entry not found in lexicon.

getCategoriesForEntry

java.util.Set<java.lang.String> getCategoriesForEntry(java.util.List<java.lang.String> sentence,
                                                      int entryIndex)
Get categories for an entry in a sentence.

Parameters:
sentence - List of entries in sentence.
entryIndex - Index within sentence (0-based) of entry.
Returns:
Set of categories. Null if entry not found in lexicon.

getNumberOfCategoriesForEntry

int getNumberOfCategoriesForEntry(java.lang.String entry)
Get number of categories for an entry.

Parameters:
entry - Entry for which to find number of categories.
Returns:
Number of categories for entry.

getCategoryCountsForEntry

java.util.Map<java.lang.String,MutableInteger> getCategoryCountsForEntry(java.lang.String entry)
Get category counts for an entry.

Parameters:
entry - Entry to look up.
Returns:
Map of counts for each category. String keys are tags, Integer counts are values. Null if entry not found in lexicon.

getLargestCategory

java.lang.String getLargestCategory(java.lang.String entry)
Get category with largest count for an entry.

Parameters:
entry - Entry to look up.
Returns:
Category with largest count. Null if entry not found in lexicon.

getCategoryCount

int getCategoryCount(java.lang.String entry,
                     java.lang.String category)
Get count for an entry in a specific category.

Parameters:
entry - Entry to look up.
category - Category for which to retrieve count.
Returns:
Number of occurrences of entry in category.

getLemma

java.lang.String getLemma(java.lang.String entry)
Get lemma for an entry.

Parameters:
entry - Entry to look up.
Returns:
Lemma form of entry.

getLemmata

java.lang.String[] getLemmata(java.lang.String entry)
Get all lemmata for an entry.

Parameters:
entry - Entry to look up.
Returns:
Lemmata forms of entry.

getLemma

java.lang.String getLemma(java.lang.String entry,
                          java.lang.String category)
Get lemma for an entry in a specific category.

Parameters:
entry - Entry to look up.
category - Category for which to retrieve lemma.
Returns:
Lemma form of entry.

getEntryCount

int getEntryCount(java.lang.String entry)
Get total count for an entry.

Parameters:
entry - Entry to look up.
Returns:
Count of occurrences of entry.

getCategoryCount

int getCategoryCount(java.lang.String category)
Get category count.

Parameters:
category - Get number of times category appears in lexicon.
Returns:
Category count.

getCategoryCounts

java.util.Map<java.lang.String,MutableInteger> getCategoryCounts()
Get category counts.

Returns:
Category counts map.

getNumberOfCategories

int getNumberOfCategories()
Get number of categories.

Returns:
Number of categories.

saveLexiconToTextFile

void saveLexiconToTextFile(java.lang.String lexiconFileName,
                           java.lang.String encoding)
                           throws java.io.IOException
Save lexicon to a file.

Parameters:
lexiconFileName - File containing the lexicon.
encoding - Character encoding of lexicon file text.
Throws:
java.io.IOException

getPartOfSpeechTags

PartOfSpeechTags getPartOfSpeechTags()
Get the part of speech tags list used by the lexicon.

Returns:
Part of speech tags list.

setPartOfSpeechTags

boolean setPartOfSpeechTags(PartOfSpeechTags partOfSpeechTags)
Set the part of speech tags list used by the lexicon.

Parameters:
partOfSpeechTags - Part of speech tags list.

getLongestEntryLength

int getLongestEntryLength()
Get the longest entry length in the lexicon.

Returns:
The longest entry length in the lexicon.

getShortestEntryLength

int getShortestEntryLength()
Get the shortest entry length in the lexicon.

Returns:
The shortest entry length in the lexicon.