edu.northwestern.at.morphadorner.tools.relemmatize
Class RelemmatizeFilter

java.lang.Object
  extended by org.xml.sax.helpers.XMLFilterImpl
      extended by edu.northwestern.at.utils.xml.ExtendedXMLFilterImpl
          extended by edu.northwestern.at.morphadorner.tools.relemmatize.RelemmatizeFilter
All Implemented Interfaces:
org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler, org.xml.sax.XMLFilter, org.xml.sax.XMLReader

public class RelemmatizeFilter
extends ExtendedXMLFilterImpl

Filter to update standard spellings and lemmata in adorned file.


Field Summary
protected  java.lang.String lemmaSeparator
          Lemma separator.
protected  int lemmataChanged
          Number of lemmata changed.
protected  Lemmatizer lemmatizer
          Lemmatizer.
protected  NameStandardizer nameStandardizer
          Name standardizer.
protected  PartOfSpeechTags partOfSpeechTags
          Part of speech tags.
protected  SpellingMapper spellingMapper
          Spelling mapper.
protected  WordTokenizer spellingTokenizer
          Spelling tokenizer.
protected  int standardChanged
          Number of standard spellings changed.
protected  SpellingStandardizer standardizer
          Spelling standardizer.
protected  Lexicon wordLexicon
          Word lexicon.
protected  int wordsProcessed
          Number of words processed.
 
Constructor Summary
RelemmatizeFilter(org.xml.sax.XMLReader reader, Lexicon wordLexicon, Lemmatizer lemmatizer, NameStandardizer nameStandardizer, SpellingStandardizer standardizer, SpellingMapper spellingMapper)
          Create adorned word info filter.
 
Method Summary
 java.lang.String getLemma(java.lang.String spelling, java.lang.String partOfSpeech)
          Get lemma for a word.
 int getLemmataChanged()
          Return number of lemmata changed.
 int getStandardChanged()
          Return number of standard spellings changed.
protected  java.lang.String getStandardizedSpelling(java.lang.String correctedSpelling, java.lang.String partOfSpeech)
          Get standardized spelling.
 int getWordsProcessed()
          Return number of words processed.
 void startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, org.xml.sax.Attributes atts)
          Handle start of an XML element.
 
Methods inherited from class edu.northwestern.at.utils.xml.ExtendedXMLFilterImpl
removeAttribute, setAttributeValue, setAttributeValue, setAttributeValue
 
Methods inherited from class org.xml.sax.helpers.XMLFilterImpl
characters, endDocument, endElement, endPrefixMapping, error, fatalError, getContentHandler, getDTDHandler, getEntityResolver, getErrorHandler, getFeature, getParent, getProperty, ignorableWhitespace, notationDecl, parse, parse, processingInstruction, resolveEntity, setContentHandler, setDocumentLocator, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setParent, setProperty, skippedEntity, startDocument, startPrefixMapping, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

wordLexicon

protected Lexicon wordLexicon
Word lexicon.


lemmatizer

protected Lemmatizer lemmatizer
Lemmatizer.


nameStandardizer

protected NameStandardizer nameStandardizer
Name standardizer.


standardizer

protected SpellingStandardizer standardizer
Spelling standardizer.


spellingMapper

protected SpellingMapper spellingMapper
Spelling mapper.


partOfSpeechTags

protected PartOfSpeechTags partOfSpeechTags
Part of speech tags.


spellingTokenizer

protected WordTokenizer spellingTokenizer
Spelling tokenizer.


lemmaSeparator

protected java.lang.String lemmaSeparator
Lemma separator.


lemmataChanged

protected int lemmataChanged
Number of lemmata changed.


standardChanged

protected int standardChanged
Number of standard spellings changed.


wordsProcessed

protected int wordsProcessed
Number of words processed.

Constructor Detail

RelemmatizeFilter

public RelemmatizeFilter(org.xml.sax.XMLReader reader,
                         Lexicon wordLexicon,
                         Lemmatizer lemmatizer,
                         NameStandardizer nameStandardizer,
                         SpellingStandardizer standardizer,
                         SpellingMapper spellingMapper)
Create adorned word info filter.

Parameters:
reader - XML input reader to which this filter applies.
Method Detail

startElement

public void startElement(java.lang.String uri,
                         java.lang.String localName,
                         java.lang.String qName,
                         org.xml.sax.Attributes atts)
                  throws org.xml.sax.SAXException
Handle start of an XML element.

Specified by:
startElement in interface org.xml.sax.ContentHandler
Overrides:
startElement in class org.xml.sax.helpers.XMLFilterImpl
Parameters:
uri - The XML element's URI.
localName - The XML element's local name.
qName - The XML element's qname.
atts - The XML element's attributes.
Throws:
org.xml.sax.SAXException

getLemma

public java.lang.String getLemma(java.lang.String spelling,
                                 java.lang.String partOfSpeech)
Get lemma for a word.

Parameters:
spelling - The word spelling.
partOfSpeech - The part of speech.

On output, sets the lemma field of the adorned word We look in the word lexicon first for the lemma. If the lexicon does not contain the lemma, we use the lemmatizer.


getStandardizedSpelling

protected java.lang.String getStandardizedSpelling(java.lang.String correctedSpelling,
                                                   java.lang.String partOfSpeech)
Get standardized spelling.

Parameters:
correctedSpelling - The spelling.
partOfSpeech - The part of speech tag.
Returns:
Standardized spelling.

getLemmataChanged

public int getLemmataChanged()
Return number of lemmata changed.

Returns:
Number of lemmata changed.

getStandardChanged

public int getStandardChanged()
Return number of standard spellings changed.

Returns:
Number of standard spellings changed.

getWordsProcessed

public int getWordsProcessed()
Return number of words processed.

Returns:
Number of words processed.