public class HeppleTagger extends AbstractPartOfSpeechTagger implements PartOfSpeechTagger, PartOfSpeechRetagger
Copyright (c) 2001-2005, The University of Sheffield.
This file is part of GATE (see http://gate.ac.uk/), and is free software, licenced under the GNU Library General Public License, Version 2, June 1991 (in the distribution as file licence.html, and also available at http://gate.ac.uk/gate/licence.html).
HeppleTagger was originally written by Mark Hepple. The GATE version contains modifications by Valentin Tablan and Niraj Aswani.
This version also contains many modifications made at Northwestern University for use in the WordHoard project.
Comments:
Implements a version of the decision list based tagging method described in:
M. Hepple. 2000. Independence and Commitment: Assumptions for Rapid Training and Execution of Rule-based Part-of-Speech Taggers. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL-2000). Hong Kong, October 2000.
Modified by Philip R. Burns at Northwestern University to remove dependencies upon the Penn Treebank tag set, to allow plugable handling of unknown words, to remove all input/output for tagged text and rules to calling classes, and to allow the Hepple tagger to be used as a retagger.
Modifier and Type | Field and Description |
---|---|
protected boolean |
debug
Debug flag.
|
java.lang.String[][] |
lexBuff
Sliding parts of speech buffer.
|
protected java.util.Map<java.lang.String,java.util.List<Rule>> |
rules
Tagging rules.
|
protected static java.lang.String |
staart
Marks unused positions in sliding word buffer.
|
protected static java.lang.String[] |
staartLex |
protected static AdornedWord |
staartWordAndTag |
java.lang.String[] |
tagBuff
Sliding tag buffer.
|
java.lang.String[] |
wordBuff
Sliding word buffer.
|
contextRules, contextualSmoother, dynamicLexicon, lexicalRules, lexicalSmoother, lexicon, logger, partOfSpeechGuesser, postTokenizer, retagger, ruleCorrections, transitionMatrix
Constructor and Description |
---|
HeppleTagger()
Construct a Hepple POS tagger.
|
Modifier and Type | Method and Description |
---|---|
protected Rule |
createNewRule(java.lang.String ruleId)
Creates a new rule of the required type according to the provided ID.
|
boolean |
getCanAddOrDeleteWords()
Can retagger add or delete words in the original sentence?
|
protected java.lang.String[] |
getPartsOfSpeech(java.lang.String word,
boolean isFirstWord)
Get parts of speech for a word.
|
protected <T extends AdornedWord> |
oneRetagStep(T adornedWord,
boolean isFirstWord,
java.util.List<T> taggedSentence)
Adds a new word to the current retagging window.
|
protected boolean |
oneStep(AdornedWord word,
boolean isFirstWord,
java.util.List taggedSentence)
Adds a new word to the current tagging window.
|
<T extends AdornedWord> |
retagSentence(java.util.List<T> sentence)
Retag one sentence.
|
void |
setCanAddOrDeleteWords(boolean canAddOrDeleteWords)
Can retagger add or delete words in the original sentence?
|
void |
setContextRules(java.lang.String[] contextRules)
Set context rules for tagging.
|
<T extends AdornedWord> |
tagAdornedWordList(java.util.List<T> sentence)
Tag an adorned word list.
|
java.lang.String |
toString()
Return tagger description.
|
boolean |
usesContextRules()
See if tagger uses context rules.
|
clearRuleCorrections, createPartOfSpeechGuesser, getContextualSmoother, getDynamicLexicon, getLexicalSmoother, getLexicon, getLexicon, getLogger, getMostCommonTag, getPartOfSpeechGuesser, getPostTokenizer, getRetagger, getRuleCorrections, getTagCount, getTagsForWord, getTransitionMatrix, incrementRuleCorrections, retagWords, setContextualSmoother, setLexicalRules, setLexicalSmoother, setLexicon, setLogger, setPartOfSpeechGuesser, setPostTokenizer, setRetagger, setTransitionMatrix, tagAdornedWordSentence, tagAdornedWordSentences, tagSentence, tagSentences, usesLexicalRules, usesTransitionProbabilities
close
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
clearRuleCorrections, getContextualSmoother, getLexicalSmoother, getLexicon, getLexicon, getPartOfSpeechGuesser, getPostTokenizer, getRetagger, getRuleCorrections, getTagCount, getTagsForWord, getTransitionMatrix, incrementRuleCorrections, retagWords, setContextualSmoother, setLexicalRules, setLexicalSmoother, setLexicon, setPartOfSpeechGuesser, setPostTokenizer, setRetagger, setTransitionMatrix, tagAdornedWordSentence, tagAdornedWordSentences, tagSentence, tagSentences, usesLexicalRules, usesTransitionProbabilities
close
protected java.util.Map<java.lang.String,java.util.List<Rule>> rules
The tagging rules are stored in a map. The map keys are parts of speech. The value for each part of speech key is a lists of rules which apply to that part of speech.
Tagging rules are specified using the syntax proposed by Eric Brill in his dissertation. Rules take the general form:
fromtag totag condition param1 param2
where "fromtag" is the current tag for a word, "totag" is the new tag to replace the current tag if the "condition" is met, and "param1" and "param2" are optional values for the condition test. Each rule must specify at least the fromtag. totag, and condition. The fromtag values are the keys for the rules map.
protected static final java.lang.String staart
protected static final java.lang.String[] staartLex
protected static final AdornedWord staartWordAndTag
public java.lang.String[] wordBuff
public java.lang.String[] tagBuff
public java.lang.String[][] lexBuff
protected boolean debug
public boolean usesContextRules()
usesContextRules
in interface PartOfSpeechTagger
usesContextRules
in class AbstractPartOfSpeechTagger
public void setContextRules(java.lang.String[] contextRules) throws InvalidRuleException
setContextRules
in interface PartOfSpeechTagger
setContextRules
in class AbstractPartOfSpeechTagger
contextRules
- String array of context rules.InvalidRuleException
- if a rule is bad.protected Rule createNewRule(java.lang.String ruleId) throws InvalidRuleException
ruleId
- The ID for the rule to be createdInvalidRuleException
public <T extends AdornedWord> java.util.List<T> tagAdornedWordList(java.util.List<T> sentence)
tagAdornedWordList
in interface PartOfSpeechTagger
tagAdornedWordList
in class AbstractPartOfSpeechTagger
sentence
- The sentence as an AdornedWord
.AdornedWord
of the words in the sentence tagged with
parts of speech.
The input sentence is a AdornedWord
of words to be tagged. The output is the same list of words with
parts of speech added.
protected boolean oneStep(AdornedWord word, boolean isFirstWord, java.util.List taggedSentence)
word
- The new word to add.isFirstWord
- True if word is first in sentence.taggedSentence
- A List of adorned words
representing the results of tagging
the current sentence so far.Adds a new word to the current window of 7 words (on the last position) and tags the word currently in the middle (i.e. on position 3). This function also reads the word on the first position and adds its tag to the taggedSentence structure as this word would be lost at the next advance. If this word completes a sentence then it returns true otherwise it returns false.
public <T extends AdornedWord> java.util.List<T> retagSentence(java.util.List<T> sentence)
retagSentence
in interface PartOfSpeechRetagger
sentence
- List of adorned words to retag.protected <T extends AdornedWord> boolean oneRetagStep(T adornedWord, boolean isFirstWord, java.util.List<T> taggedSentence)
adornedWord
- The new word and its tag.isFirstWord
- True if word is first in sentence.taggedSentence
- A List of adorned words
representing the results of tagging
the current sentence so far.Adds a new word to the current window of 7 words (on the last position) and tags the word currently in the middle (i.e. on position 3). This function also reads the word on the first position and adds its tag to the taggedSentence structure as this word would be lost at the next advance. If this word completes a sentence then it returns true otherwise it returns false.
protected java.lang.String[] getPartsOfSpeech(java.lang.String word, boolean isFirstWord)
word
- The word to be classified.isFirstWord
- True if word is first word in sentence.The lexicon must always return one or more parts of speech. In addition, for this tagger, the most frequently occurring tag must be the first one in the returned string array.
public boolean getCanAddOrDeleteWords()
getCanAddOrDeleteWords
in interface PartOfSpeechRetagger
public void setCanAddOrDeleteWords(boolean canAddOrDeleteWords)
setCanAddOrDeleteWords
in interface PartOfSpeechRetagger
canAddOrDeleteWords
- true if retagger can add or
delete words.
Ignored here.
public java.lang.String toString()
toString
in class java.lang.Object