edu.northwestern.at.utils.corpuslinguistics.postagger
Class AbstractPartOfSpeechTagger

java.lang.Object
  extended by edu.northwestern.at.utils.IsCloseableObject
      extended by edu.northwestern.at.utils.corpuslinguistics.postagger.AbstractPartOfSpeechTagger
All Implemented Interfaces:
UsesLexicon, PartOfSpeechTagger, IsCloseable, UsesLogger
Direct Known Subclasses:
BigramTagger, HeppleTagger, SimpleTagger, TrigramTagger, UnigramTagger

public abstract class AbstractPartOfSpeechTagger
extends IsCloseableObject
implements PartOfSpeechTagger, IsCloseable, UsesLexicon, UsesLogger

Abstract Part of Speech tagger.

Provides default implementations for all of the PartOfSpeech interface methods. To create a new PartOfSpeech tagger, extend this class and override methods as needed. You must override the tagSentence method as a minimum.


Field Summary
protected  java.lang.String[] contextRules
          Context rules.
protected  ContextualSmoother contextualSmoother
          Contextual smoother.
protected  Lexicon dynamicLexicon
          Dynamic lexicon built on-the-fly for words not in static lexicon.
protected  java.lang.String[] lexicalRules
          Lexical rules.
protected  LexicalSmoother lexicalSmoother
          Lexical smoother.
protected  Lexicon lexicon
          Static lexicon used by tagger.
protected  Logger logger
          Logger used for output.
protected  PartOfSpeechGuesser partOfSpeechGuesser
          Part of speech guesser for words not in lexicon.
protected  PostTokenizer postTokenizer
          PostTokenizer for mapping raw tokens to initial spellings.
protected  PartOfSpeechRetagger retagger
          Fixup retagger.
protected  int ruleCorrections
          Number of corrections applied by rules.
protected  TransitionMatrix transitionMatrix
          Transition matrix used by tagger.
 
Constructor Summary
AbstractPartOfSpeechTagger()
          Create tagger.
 
Method Summary
 void clearRuleCorrections()
          Clear count of successful rule applications.
protected  void createPartOfSpeechGuesser()
          Create a part of speech guesser.
 Lexicon getDynamicLexicon()
          Get the dynamic word lexicon.
 Lexicon getLexicon()
          Get the static word lexicon.
 Lexicon getLexicon(java.lang.String word)
          Get the lexicon associated with a specific word.
 Logger getLogger()
          Get the logger.
 java.lang.String getMostCommonTag(java.lang.String word)
          Get the most common tag for a word.
 PartOfSpeechGuesser getPartOfSpeechGuesser()
          Get part of speech guesser.
 PartOfSpeechRetagger getRetagger()
          Get part of speech retagger.
 int getRuleCorrections()
          Get count of successful rule applications.
 int getTagCount(java.lang.String word, java.lang.String tag)
          Get count of times a word appears with a given tag.
 java.util.List<java.lang.String> getTagsForWord(java.lang.String word)
          Get potential part of speech tags for a word.
 TransitionMatrix getTransitionMatrix()
          Get tag transition probabilities matrix.
 void incrementRuleCorrections()
          Increment count of successful rule applications.
<T extends AdornedWord>
java.util.List<T>
retagWords(java.util.List<T> taggedSentence)
          Retag words in a tagged sentence.
 void setContextRules(java.lang.String[] contextRules)
          Set context rules for tagging.
 void setLexicalRules(java.lang.String[] lexicalRules)
          Set lexical rules for tagging.
 void setLexicon(Lexicon lexicon)
          Set the lexicon.
 void setLogger(Logger logger)
          Set the logger.
 void setPartOfSpeechGuesser(PartOfSpeechGuesser partOfSpeechGuesser)
          Set part of speech guesser.
 void setRetagger(PartOfSpeechRetagger retagger)
          Set part of speech retagger.
 void setTransitionMatrix(TransitionMatrix transitionMatrix)
          Set tag transition probabilities matrix.
abstract
<T extends AdornedWord>
java.util.List<T>
tagAdornedWordList(java.util.List<T> sentence)
          Tag a list of adorned words.
<T extends AdornedWord>
java.util.List<T>
tagAdornedWordSentence(java.util.List<T> sentence)
          Tag a sentence.
<T extends AdornedWord>
java.util.List<java.util.List<T>>
tagAdornedWordSentences(java.util.List<java.util.List<T>> sentences)
          Tag a list of sentences.
 java.util.List<AdornedWord> tagSentence(java.util.List<java.lang.String> sentence)
          Tag a sentence.
 java.util.List<java.util.List<AdornedWord>> tagSentences(java.util.List<java.util.List<java.lang.String>> sentences)
          Tag a list of sentences.
 boolean usesContextRules()
          See if tagger uses context rules.
 boolean usesLexicalRules()
          See if tagger uses lexical rules.
 boolean usesTransitionProbabilities()
          See if tagger uses a probability transition matrix.
 
Methods inherited from class edu.northwestern.at.utils.IsCloseableObject
close
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface edu.northwestern.at.utils.IsCloseable
close
 

Field Detail

lexicon

protected Lexicon lexicon
Static lexicon used by tagger.


dynamicLexicon

protected Lexicon dynamicLexicon
Dynamic lexicon built on-the-fly for words not in static lexicon.


transitionMatrix

protected TransitionMatrix transitionMatrix
Transition matrix used by tagger.


contextRules

protected java.lang.String[] contextRules
Context rules.


lexicalRules

protected java.lang.String[] lexicalRules
Lexical rules.


lexicalSmoother

protected LexicalSmoother lexicalSmoother
Lexical smoother.


contextualSmoother

protected ContextualSmoother contextualSmoother
Contextual smoother.


retagger

protected PartOfSpeechRetagger retagger
Fixup retagger.


partOfSpeechGuesser

protected PartOfSpeechGuesser partOfSpeechGuesser
Part of speech guesser for words not in lexicon.


postTokenizer

protected PostTokenizer postTokenizer
PostTokenizer for mapping raw tokens to initial spellings.


ruleCorrections

protected int ruleCorrections
Number of corrections applied by rules.


logger

protected Logger logger
Logger used for output.

Constructor Detail

AbstractPartOfSpeechTagger

public AbstractPartOfSpeechTagger()
Create tagger.

Method Detail

getLogger

public Logger getLogger()
Get the logger.

Specified by:
getLogger in interface UsesLogger
Returns:
The logger.

setLogger

public void setLogger(Logger logger)
Set the logger.

Specified by:
setLogger in interface UsesLogger
Parameters:
logger - The logger.

usesContextRules

public boolean usesContextRules()
See if tagger uses context rules.

Specified by:
usesContextRules in interface PartOfSpeechTagger
Returns:
True if tagger uses context rules.

usesLexicalRules

public boolean usesLexicalRules()
See if tagger uses lexical rules.

Specified by:
usesLexicalRules in interface PartOfSpeechTagger
Returns:
True if tagger uses lexical rules.

usesTransitionProbabilities

public boolean usesTransitionProbabilities()
See if tagger uses a probability transition matrix.

Specified by:
usesTransitionProbabilities in interface PartOfSpeechTagger
Returns:
True if tagger uses probability transition matrix.

setContextRules

public void setContextRules(java.lang.String[] contextRules)
                     throws InvalidRuleException
Set context rules for tagging.

Specified by:
setContextRules in interface PartOfSpeechTagger
Parameters:
contextRules - String array of context rules.
Throws:
InvalidRuleException - if a rule is bad.

For taggers which do not use context rules, this is a no-op.


setLexicalRules

public void setLexicalRules(java.lang.String[] lexicalRules)
                     throws InvalidRuleException
Set lexical rules for tagging.

Specified by:
setLexicalRules in interface PartOfSpeechTagger
Parameters:
lexicalRules - String array of lexical rules.
Throws:
InvalidRuleException - if a rule is bad.

For taggers which do not use lexical rules, this is a no-op.


getLexicon

public Lexicon getLexicon()
Get the static word lexicon.

Specified by:
getLexicon in interface UsesLexicon
Specified by:
getLexicon in interface PartOfSpeechTagger
Returns:
The static word lexicon.

getDynamicLexicon

public Lexicon getDynamicLexicon()
Get the dynamic word lexicon.

Returns:
The dynamic lexicon.

getLexicon

public Lexicon getLexicon(java.lang.String word)
Get the lexicon associated with a specific word.

Specified by:
getLexicon in interface PartOfSpeechTagger
Parameters:
word - The word whose source lexicon is sought.
Returns:
The lexicon.

Most words do not have a source lexicon defined, in which case they come from the main static word lexicon. Usually only words derived by a suffix analysis have a source lexicon defined, which will of course be the suffix lexicon.


setLexicon

public void setLexicon(Lexicon lexicon)
Set the lexicon.

Specified by:
setLexicon in interface UsesLexicon
Specified by:
setLexicon in interface PartOfSpeechTagger
Parameters:
lexicon - Lexicon used for tagging.

getTransitionMatrix

public TransitionMatrix getTransitionMatrix()
Get tag transition probabilities matrix.

Specified by:
getTransitionMatrix in interface PartOfSpeechTagger
Returns:
Tag probabilities transition matrix. May be null for taggers which do not use a transition matrix.

setTransitionMatrix

public void setTransitionMatrix(TransitionMatrix transitionMatrix)
Set tag transition probabilities matrix.

Specified by:
setTransitionMatrix in interface PartOfSpeechTagger
Parameters:
transitionMatrix - Tag probabilities transition matrix.

For taggers which do not use transition matrices, this is a no-op.


getPartOfSpeechGuesser

public PartOfSpeechGuesser getPartOfSpeechGuesser()
Get part of speech guesser.

Specified by:
getPartOfSpeechGuesser in interface PartOfSpeechTagger
Returns:
The part of speech guesser.

setPartOfSpeechGuesser

public void setPartOfSpeechGuesser(PartOfSpeechGuesser partOfSpeechGuesser)
Set part of speech guesser.

Specified by:
setPartOfSpeechGuesser in interface PartOfSpeechTagger
Parameters:
partOfSpeechGuesser - The part of speech guesser.

getRetagger

public PartOfSpeechRetagger getRetagger()
Get part of speech retagger.

Specified by:
getRetagger in interface PartOfSpeechTagger
Returns:
The part of speech retagger. May be null.

setRetagger

public void setRetagger(PartOfSpeechRetagger retagger)
Set part of speech retagger.

Specified by:
setRetagger in interface PartOfSpeechTagger
Parameters:
retagger - The part of speech retagger.

getTagsForWord

public java.util.List<java.lang.String> getTagsForWord(java.lang.String word)
Get potential part of speech tags for a word.

Specified by:
getTagsForWord in interface PartOfSpeechTagger
Parameters:
word - The word whose part of speech tags we want.
Returns:
List of part of speech tags. May be null or empty.

When the word does not appear in the lexicon, the part of speech guesser is used to determine the tags based upon features of the word (suffix analysis, etc.).


getTagCount

public int getTagCount(java.lang.String word,
                       java.lang.String tag)
Get count of times a word appears with a given tag.

Specified by:
getTagCount in interface PartOfSpeechTagger
Parameters:
word - The word.
tag - The part of speech tag.
Returns:
The number of times the word appears with the given tag.

When the word does not appear in the lexicon, the part of speech guesser is used to compute a count based upon features of the word (suffix analysis, etc.).


getMostCommonTag

public java.lang.String getMostCommonTag(java.lang.String word)
Get the most common tag for a word.

Parameters:
word - The word.
Returns:
The most common part of speech tag for the word.

tagSentences

public java.util.List<java.util.List<AdornedWord>> tagSentences(java.util.List<java.util.List<java.lang.String>> sentences)
Tag a list of sentences.

Specified by:
tagSentences in interface PartOfSpeechTagger
Parameters:
sentences - The list of sentences.
Returns:
The sentences with words adorned with parts of speech.

The sentences are a List of Lists of words to be tagged. Each sentence is represented as a list of words. The output is a list of AdornedWords.


tagAdornedWordSentences

public <T extends AdornedWord> java.util.List<java.util.List<T>> tagAdornedWordSentences(java.util.List<java.util.List<T>> sentences)
Tag a list of sentences.

Specified by:
tagAdornedWordSentences in interface PartOfSpeechTagger
Parameters:
sentences - The list of sentences.
Returns:
The sentences with words adorned with parts of speech.

The sentences are a List of Lists of adorned words to be tagged. Each sentence is represented as a list of words. The output is a list of AdornedWords.


retagWords

public <T extends AdornedWord> java.util.List<T> retagWords(java.util.List<T> taggedSentence)
Retag words in a tagged sentence.

Specified by:
retagWords in interface PartOfSpeechTagger
Parameters:
taggedSentence - The tagged sentence.
Returns:
The retagged sentence.

This method calls the retagger, if any. If no retagger is defined, the input tagged sentence is returned unchanged. Override this method to add custom retagging without the use of a retagger.


clearRuleCorrections

public void clearRuleCorrections()
Clear count of successful rule applications.

Specified by:
clearRuleCorrections in interface PartOfSpeechTagger

incrementRuleCorrections

public void incrementRuleCorrections()
Increment count of successful rule applications.

Specified by:
incrementRuleCorrections in interface PartOfSpeechTagger

getRuleCorrections

public int getRuleCorrections()
Get count of successful rule applications.

Specified by:
getRuleCorrections in interface PartOfSpeechTagger

createPartOfSpeechGuesser

protected void createPartOfSpeechGuesser()
Create a part of speech guesser.


tagSentence

public java.util.List<AdornedWord> tagSentence(java.util.List<java.lang.String> sentence)
Tag a sentence.

Specified by:
tagSentence in interface PartOfSpeechTagger
Parameters:
sentence - The sentence as a list of string words.
Returns:
An AdornedWord of the words in the sentence tagged with parts of speech.

The input sentence is a List of string words to be tagged. The output is AdornedWord of the words with parts of speech added.


tagAdornedWordSentence

public <T extends AdornedWord> java.util.List<T> tagAdornedWordSentence(java.util.List<T> sentence)
Tag a sentence.

Specified by:
tagAdornedWordSentence in interface PartOfSpeechTagger
Parameters:
sentence - The sentence as a list of string words.
Returns:
An AdornedWord of the words in the sentence tagged with parts of speech.

The input sentence is a List of adorned words to be tagged. The output is the same list with parts of speech added/modified.


tagAdornedWordList

public abstract <T extends AdornedWord> java.util.List<T> tagAdornedWordList(java.util.List<T> sentence)
Tag a list of adorned words.

Specified by:
tagAdornedWordList in interface PartOfSpeechTagger
Parameters:
sentence - The sentence as an AdornedWord.
Returns:
The tagged sentence (same as input with parts of speech added).