public abstract class AbstractPartOfSpeechGuesser extends IsCloseableObject implements PartOfSpeechGuesser, UsesLogger
A part of speech guesser "guesses" the probable part(s) of speech for a word which does not appear in the main lexicon. Alternate spellings, lexical rules based upon word prefixes or suffixes, and many other approaches may be used to find potential part of speech. This AbstractPartOfSpeechGuesser holds the references to a word lexicon, suffix lexicon, and spelling standardizer. Subclasses must override the abstract method guessPartsOfSpeech.
Some of the heuristics here only work reliably for English language text.
Modifier and Type | Field and Description |
---|---|
protected Abbreviations |
abbreviations
Abbreviations.
|
protected SpellingStandardizer |
auxiliarySpellingStandardizer
The auxiliary spelling standardizer.
|
protected java.util.List<TaggedStrings> |
auxiliaryWordLists
Auxiliary word lists.
|
protected java.util.Map<java.lang.String,Lexicon> |
cachedLexicons
Cache lexicon for unknown words.
|
protected Cache<java.lang.String,java.util.Map<java.lang.String,MutableInteger>> |
cachedWords
Cache parts of speech for unknown words.
|
protected boolean |
checkPossessives
Check for possessives of known nouns when guessing parts of speech.
|
protected boolean |
debug
True to enable debugging output.
|
protected Logger |
logger
Logger used for output.
|
protected Names |
names
Proper names.
|
protected SpellingStandardizer |
spellingStandardizer
The principal spelling standardizer.
|
protected Lexicon |
suffixLexicon
The affix/suffix lexicon.
|
protected boolean |
tryStandardSpellings
Try standardized spellings when guessing parts of speech.
|
protected Lexicon |
wordLexicon
The word lexicon.
|
Constructor and Description |
---|
AbstractPartOfSpeechGuesser() |
Modifier and Type | Method and Description |
---|---|
void |
addAuxiliaryWordList(TaggedStrings wordList)
Add an auxiliary word list.
|
protected void |
addCachedWord(java.lang.String word,
java.util.Map<java.lang.String,MutableInteger> tagMap)
Add word to cache.
|
java.util.Map<java.lang.String,MutableInteger> |
checkAbbreviation(java.lang.String word)
Check if word is abbreviation.
|
java.util.Map<java.lang.String,MutableInteger> |
checkAllUpperCase(java.lang.String word)
See if word is all uppercase.
|
java.util.Map<java.lang.String,MutableInteger> |
checkAuxiliaryWordLists(java.lang.String word)
See if word is defined in an auxiliary word list.
|
java.util.Map<java.lang.String,MutableInteger> |
checkCachedWord(java.lang.String word)
See if we have part of speech for a cached word.
|
java.util.Map<java.lang.String,MutableInteger> |
checkCurrency(java.lang.String word)
See if word is currency value.
|
java.util.Map<java.lang.String,MutableInteger> |
checkHyphenatedWord(java.lang.String word)
Check if word contains a hyphen.
|
java.util.Map<java.lang.String,MutableInteger> |
checkName(java.lang.String word)
See if word is a name.
|
java.util.Map<java.lang.String,MutableInteger> |
checkNumber(java.lang.String word)
See if word is a number.
|
java.util.Map<java.lang.String,MutableInteger> |
checkPossessiveNoun(java.lang.String word)
See if word is a possessive noun.
|
java.util.Map<java.lang.String,MutableInteger> |
checkPunctuation(java.lang.String word)
Check if word is punctuation.
|
java.util.Map<java.lang.String,MutableInteger> |
checkRomanNumeral(java.lang.String word)
See if word is Roman numeral.
|
protected java.util.Map<java.lang.String,MutableInteger> |
checkStandardSpellings(java.lang.String word,
java.lang.String[] standardSpellings)
Try to get tags using standardized spellings.
|
java.util.Map<java.lang.String,MutableInteger> |
checkSuffixes(java.lang.String word)
Try to get tags using suffix analysis.
|
java.util.Map<java.lang.String,MutableInteger> |
checkSuffixes(java.lang.String word,
java.lang.String[] standardSpellings)
Try to get tags using suffix analysis.
|
java.util.Map<java.lang.String,MutableInteger> |
checkSymbol(java.lang.String word)
Check if word is symbol.
|
protected java.util.Map<java.lang.String,MutableInteger> |
clonePosTagMap(java.util.Map<java.lang.String,MutableInteger> posTagMap)
Clone pos tag map.
|
java.util.List<TaggedStrings> |
getAuxiliaryWordLists()
Get auxiliary word lists.
|
Lexicon |
getCachedLexiconForWord(java.lang.String word)
Get cached lexicon for a word.
|
Logger |
getLogger()
Get the logger.
|
java.util.Map<java.lang.String,MutableInteger> |
getNoun(java.lang.String word)
Get tags for a noun.
|
SpellingStandardizer |
getSpellingStandardizer()
Get spelling standardizer.
|
protected java.lang.String[] |
getStandardizedSpellings(java.lang.String word)
Get standardized spellings for a word.
|
Lexicon |
getSuffixLexicon()
Get the suffix lexicon.
|
Lexicon |
getWordLexicon()
Get the word lexicon.
|
abstract java.util.Map<java.lang.String,MutableInteger> |
guessPartsOfSpeech(java.lang.String word)
Guesses part of speech for a word.
|
java.util.Map<java.lang.String,MutableInteger> |
guessPartsOfSpeech(java.lang.String word,
boolean isFirstWord)
Guesses part of speech for a word.
|
protected java.util.Map<java.lang.String,MutableInteger> |
posTagsToMap(java.lang.String[] posTags)
Create map from array of part of speech tags.
|
protected java.util.Map<java.lang.String,MutableInteger> |
posTagToMap(java.lang.String posTag)
Create map with one (pos, count) entry.
|
void |
removeCompoundTags(java.util.Map<java.lang.String,MutableInteger> posTagsMap)
Remove compound part of speech tags from tag map.
|
void |
removeProperNounTags(java.util.Map<java.lang.String,MutableInteger> posTagsMap)
Remove proper noun and proper adjective tags from tag map.
|
void |
setAbbreviations(Abbreviations abbreviations)
Set abbreviations.
|
void |
setCheckPossessives(boolean checkPossessives)
Check for possessives of known nouns when guessing parts of speech.
|
void |
setLogger(Logger logger)
Set the logger.
|
void |
setSpellingStandardizer(SpellingStandardizer spellingStandardizer)
Set spelling standardizer.
|
void |
setSuffixLexicon(Lexicon suffixLexicon)
Set the suffix lexicon.
|
void |
setTryStandardSpellings(boolean tryStandardSpellings)
Try using standardized spellings when guessing parts of speech.
|
void |
setWordLexicon(Lexicon wordLexicon)
Set the word lexicon.
|
close
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
guessPartsOfSpeech
protected boolean debug
protected Logger logger
protected Abbreviations abbreviations
protected Cache<java.lang.String,java.util.Map<java.lang.String,MutableInteger>> cachedWords
The key is the word spelling, the value is a map of parts of speech and associated counts for the spelling.
protected java.util.Map<java.lang.String,Lexicon> cachedLexicons
The key is the word spelling, the value is the lexicon to use to retrieve counts for probability calculations. Normally a cache entry is only created when the lexicon is the suffix lexicon. The word lexicon is assumed by default otherwise.
protected Lexicon wordLexicon
protected Lexicon suffixLexicon
protected SpellingStandardizer spellingStandardizer
protected SpellingStandardizer auxiliarySpellingStandardizer
protected java.util.List<TaggedStrings> auxiliaryWordLists
protected Names names
protected boolean tryStandardSpellings
protected boolean checkPossessives
public Logger getLogger()
getLogger
in interface UsesLogger
public void setLogger(Logger logger)
setLogger
in interface UsesLogger
logger
- The logger.public SpellingStandardizer getSpellingStandardizer()
getSpellingStandardizer
in interface PartOfSpeechGuesser
public void setSpellingStandardizer(SpellingStandardizer spellingStandardizer)
setSpellingStandardizer
in interface PartOfSpeechGuesser
spellingStandardizer
- The spelling standardizer.public void setAbbreviations(Abbreviations abbreviations)
setAbbreviations
in interface PartOfSpeechGuesser
abbreviations
- Abbreviations.public Lexicon getWordLexicon()
getWordLexicon
in interface PartOfSpeechGuesser
public void setWordLexicon(Lexicon wordLexicon)
setWordLexicon
in interface PartOfSpeechGuesser
wordLexicon
- The word lexicon.public Lexicon getSuffixLexicon()
getSuffixLexicon
in interface PartOfSpeechGuesser
public void setSuffixLexicon(Lexicon suffixLexicon)
setSuffixLexicon
in interface PartOfSpeechGuesser
suffixLexicon
- The suffix lexicon.protected void addCachedWord(java.lang.String word, java.util.Map<java.lang.String,MutableInteger> tagMap)
word
- The word.tagMap
- Tag map for the word.protected java.util.Map<java.lang.String,MutableInteger> posTagToMap(java.lang.String posTag)
posTag
- The part of speech tag.The count associated with the part of speech tag is the count of that tag in the word lexicon.
protected java.util.Map<java.lang.String,MutableInteger> posTagsToMap(java.lang.String[] posTags)
posTags
- The part of speech tags.The count associated with the part of speech tags is the count for each tag in the word lexicon. This is probably not the best choice but absent any other information, it is at least consistent.
protected java.util.Map<java.lang.String,MutableInteger> clonePosTagMap(java.util.Map<java.lang.String,MutableInteger> posTagMap)
posTagMap
- The pos tag map to clone.public void addAuxiliaryWordList(TaggedStrings wordList)
addAuxiliaryWordList
in interface PartOfSpeechGuesser
public java.util.List<TaggedStrings> getAuxiliaryWordLists()
getAuxiliaryWordLists
in interface PartOfSpeechGuesser
public Lexicon getCachedLexiconForWord(java.lang.String word)
getCachedLexiconForWord
in interface PartOfSpeechGuesser
word
- The word.Most words do not have an associated cached lexicon, so the word lexicon is returned. Words whose category counts result from a suffix analysis will have a cached entry pointing to the suffix lexicon.
public java.util.Map<java.lang.String,MutableInteger> checkCachedWord(java.lang.String word)
word
- The word.public java.util.Map<java.lang.String,MutableInteger> checkName(java.lang.String word)
word
- The word.Note: Only capitalized versions of names are considered.
public java.util.Map<java.lang.String,MutableInteger> checkPossessiveNoun(java.lang.String word)
word
- The word.public java.util.Map<java.lang.String,MutableInteger> checkAllUpperCase(java.lang.String word)
word
- The word.public java.util.Map<java.lang.String,MutableInteger> checkNumber(java.lang.String word)
word
- The word.public java.util.Map<java.lang.String,MutableInteger> checkCurrency(java.lang.String word)
word
- The word.public java.util.Map<java.lang.String,MutableInteger> checkRomanNumeral(java.lang.String word)
word
- The word.public java.util.Map<java.lang.String,MutableInteger> checkAuxiliaryWordLists(java.lang.String word)
word
- The word.public java.util.Map<java.lang.String,MutableInteger> checkPunctuation(java.lang.String word)
word
- The word.public java.util.Map<java.lang.String,MutableInteger> checkSymbol(java.lang.String word)
word
- The word.public java.util.Map<java.lang.String,MutableInteger> checkAbbreviation(java.lang.String word)
word
- The word.A proper noun tag is emitted when the abbreviation begins with a capital letter.
public java.util.Map<java.lang.String,MutableInteger> checkHyphenatedWord(java.lang.String word)
word
- The word.If word contains a dash, extract the part after the last dash. If that is a word in the lexicon, use its part of speech. Otherwise return with no part of speech assign and let the subsequent suffix analysis deal with the word.
The following cases are treated specially.
protected java.lang.String[] getStandardizedSpellings(java.lang.String word)
word
- The word.protected java.util.Map<java.lang.String,MutableInteger> checkStandardSpellings(java.lang.String word, java.lang.String[] standardSpellings)
word
- The word.standardSpellings
- Standard spellings on output, or null
if none.public void removeProperNounTags(java.util.Map<java.lang.String,MutableInteger> posTagsMap)
posTagsMap
- Map of potential tags.
Removes proper noun or proper adjective tags from tag map. Map is set to null if it becomes empty.
public void removeCompoundTags(java.util.Map<java.lang.String,MutableInteger> posTagsMap)
posTagsMap
- Map of potential tags.
Removes compound part of speech tags from tag map. Map is set to null if it becomes empty.
public java.util.Map<java.lang.String,MutableInteger> checkSuffixes(java.lang.String word)
word
- The word.public java.util.Map<java.lang.String,MutableInteger> checkSuffixes(java.lang.String word, java.lang.String[] standardSpellings)
word
- The word.standardSpellings
- List of standard spellings, or null
if none.public java.util.Map<java.lang.String,MutableInteger> getNoun(java.lang.String word)
word
- The word.A proper noun tag is emitted when the abbreviation begins with a capital letter. A plural noun tag is emitted when the word ends with an "s".
public java.util.Map<java.lang.String,MutableInteger> guessPartsOfSpeech(java.lang.String word, boolean isFirstWord)
guessPartsOfSpeech
in interface PartOfSpeechGuesser
word
- The word.isFirstWord
- If word is first word in a sentence.public void setTryStandardSpellings(boolean tryStandardSpellings)
setTryStandardSpellings
in interface PartOfSpeechGuesser
public void setCheckPossessives(boolean checkPossessives)
setCheckPossessives
in interface PartOfSpeechGuesser
public abstract java.util.Map<java.lang.String,MutableInteger> guessPartsOfSpeech(java.lang.String word)
guessPartsOfSpeech
in interface PartOfSpeechGuesser
word
- The word.