edu.northwestern.at.utils.corpuslinguistics.postagger.regexp
Class RegexpTagger

java.lang.Object
  extended by edu.northwestern.at.utils.IsCloseableObject
      extended by edu.northwestern.at.utils.corpuslinguistics.postagger.AbstractPartOfSpeechTagger
          extended by edu.northwestern.at.utils.corpuslinguistics.postagger.unigram.UnigramTagger
              extended by edu.northwestern.at.utils.corpuslinguistics.postagger.regexp.RegexpTagger
All Implemented Interfaces:
UsesLexicon, CanTagOneWord, PartOfSpeechTagger, IsCloseable, UsesLogger

public class RegexpTagger
extends UnigramTagger
implements PartOfSpeechTagger, CanTagOneWord

Regular Expression Part of Speech tagger.

The regular expression part of speech tagger uses a regular expressions to assign a part of speech tag to a spelling.


Field Summary
protected  java.util.regex.Matcher[] regexpMatchers
           
protected  java.util.regex.Pattern[] regexpPatterns
          Parts of speech for each lexical rule.
protected  java.lang.String[] regexpTags
           
 
Fields inherited from class edu.northwestern.at.utils.corpuslinguistics.postagger.AbstractPartOfSpeechTagger
contextRules, contextualSmoother, dynamicLexicon, lexicalRules, lexicalSmoother, lexicon, logger, partOfSpeechGuesser, postTokenizer, retagger, ruleCorrections, transitionMatrix
 
Constructor Summary
RegexpTagger()
          Create a suffix tagger.
 
Method Summary
 void setLexicalRules(java.lang.String[] lexicalRules)
          Set lexical rules for tagging.
 java.lang.String tagWord(java.lang.String word)
          Tag a single word.
 java.lang.String toString()
          Return tagger description.
 boolean usesLexicalRules()
          See if tagger uses lexical rules.
 
Methods inherited from class edu.northwestern.at.utils.corpuslinguistics.postagger.unigram.UnigramTagger
tagAdornedWordList, tagWord
 
Methods inherited from class edu.northwestern.at.utils.corpuslinguistics.postagger.AbstractPartOfSpeechTagger
clearRuleCorrections, createPartOfSpeechGuesser, getDynamicLexicon, getLexicon, getLexicon, getLogger, getMostCommonTag, getPartOfSpeechGuesser, getRetagger, getRuleCorrections, getTagCount, getTagsForWord, getTransitionMatrix, incrementRuleCorrections, retagWords, setContextRules, setLexicon, setLogger, setPartOfSpeechGuesser, setRetagger, setTransitionMatrix, tagAdornedWordSentence, tagAdornedWordSentences, tagSentence, tagSentences, usesContextRules, usesTransitionProbabilities
 
Methods inherited from class edu.northwestern.at.utils.IsCloseableObject
close
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface edu.northwestern.at.utils.corpuslinguistics.postagger.PartOfSpeechTagger
clearRuleCorrections, getLexicon, getLexicon, getPartOfSpeechGuesser, getRetagger, getRuleCorrections, getTagCount, getTagsForWord, getTransitionMatrix, incrementRuleCorrections, retagWords, setContextRules, setLexicon, setPartOfSpeechGuesser, setRetagger, setTransitionMatrix, tagAdornedWordList, tagAdornedWordSentence, tagAdornedWordSentences, tagSentence, tagSentences, usesContextRules, usesTransitionProbabilities
 
Methods inherited from interface edu.northwestern.at.utils.corpuslinguistics.postagger.CanTagOneWord
tagWord
 
Methods inherited from interface edu.northwestern.at.utils.IsCloseable
close
 

Field Detail

regexpPatterns

protected java.util.regex.Pattern[] regexpPatterns
Parts of speech for each lexical rule.


regexpMatchers

protected java.util.regex.Matcher[] regexpMatchers

regexpTags

protected java.lang.String[] regexpTags
Constructor Detail

RegexpTagger

public RegexpTagger()
Create a suffix tagger.

Method Detail

usesLexicalRules

public boolean usesLexicalRules()
See if tagger uses lexical rules.

Specified by:
usesLexicalRules in interface PartOfSpeechTagger
Overrides:
usesLexicalRules in class AbstractPartOfSpeechTagger
Returns:
True since this tagger uses regular expression based lexical rules.

setLexicalRules

public void setLexicalRules(java.lang.String[] lexicalRules)
                     throws InvalidRuleException
Set lexical rules for tagging.

Specified by:
setLexicalRules in interface PartOfSpeechTagger
Overrides:
setLexicalRules in class AbstractPartOfSpeechTagger
Parameters:
lexicalRules - String array of lexical rules.
Throws:
InvalidRuleException - if a rule is bad.

For the regular expression tagger, each rule takes the form:

regular-expression \t part-of-speech-tag

where "regular expression" is the regular expression and "part-of-speech-tag" is the part of speech tag to assign to a spelling matched by the regular expression. An ascii tab character (\t) separates the pattern from the tag.


tagWord

public java.lang.String tagWord(java.lang.String word)
Tag a single word.

Specified by:
tagWord in interface CanTagOneWord
Overrides:
tagWord in class UnigramTagger
Parameters:
word - The word.
Returns:
The part of speech for the word.

Applies each of the regular expressions stored in the lexical rules lexicon and returns the tag of associated with the first matching regular expression.


toString

public java.lang.String toString()
Return tagger description.

Overrides:
toString in class UnigramTagger
Returns:
Tagger description.