edu.northwestern.at.utils.corpuslinguistics.spellingstandardizer
Class AbstractSpellingStandardizer

java.lang.Object
  extended by edu.northwestern.at.utils.IsCloseableObject
      extended by edu.northwestern.at.utils.corpuslinguistics.spellingstandardizer.AbstractSpellingStandardizer
All Implemented Interfaces:
SpellingStandardizer, UsesLogger
Direct Known Subclasses:
DecruftifyingSpellingStandardizer, NoopSpellingStandardizer, SimpleSpellingStandardizer

public abstract class AbstractSpellingStandardizer
extends IsCloseableObject
implements SpellingStandardizer, UsesLogger

Abstract Spelling Standardizer.


Field Summary
protected  java.util.Set<java.lang.String> alternateSpellingsWordClasses
          Word classes of alternate spellings.
protected static java.lang.String defaultSpellingsByWordClassFileName
          Path to list of irregular word forms.
protected  Lexicon lexicon
          Lexicon associated with this standardizer.
protected  Logger logger
          Logger used for output.
protected  TaggedStrings mappedSpellings
          The map with alternate spellings as keys and standard spellings as values.
protected  Map2D<java.lang.String,java.lang.String,java.lang.String> spellingsByWordClass
          Irregular forms.
protected  java.util.Set<java.lang.String> standardSpellingSet
          The set of standard spellings.
 
Constructor Summary
AbstractSpellingStandardizer()
          Create abstract spelling standardizer.
 
Method Summary
 void addCachedSpelling(java.lang.String alternateSpelling, java.lang.String standardSpelling)
          Cached a generated mapped spelling.
 void addMappedSpelling(java.lang.String alternateSpelling, java.lang.String standardSpelling)
          Add a mapped spelling.
 void addStandardSpelling(java.lang.String standardSpelling)
          Add a standard spelling.
 void addStandardSpellings(java.util.Collection<java.lang.String> standardSpellings)
          Add standard spellings from a collection.
 java.lang.String fixCapitalization(java.lang.String spelling, java.lang.String standardSpelling)
          Fix capitalization of standardized spelling.
 Lexicon getLexicon()
          Get the word lexicon.
 Logger getLogger()
          Get the logger.
 TaggedStrings getMappedSpellings()
          Return the mapped spellings.
 int getNumberOfAlternateSpellings()
          Returns number of alternate spellings.
 int[] getNumberOfAlternateSpellingsByWordClass()
          Returns number of alternate spellings by word class.
 int getNumberOfStandardSpellings()
          Returns number of standard spellings.
 java.util.Set<java.lang.String> getStandardSpellings()
          Return the standard spellings.
 void loadAlternativeSpellings(java.io.Reader reader, java.lang.String delimChars)
          Loads alternative spellings from a reader.
 void loadAlternativeSpellings(java.net.URL url, java.lang.String encoding, java.lang.String delimChars)
          Loads alternate spellings from a URL.
 void loadAlternativeSpellingsByWordClass(java.net.URL spellingsURL, java.lang.String encoding)
          Load alternate to standard spellings by word class.
 void loadStandardSpellings(java.io.Reader reader)
          Loads standard spellings from a reader.
 void loadStandardSpellings(java.net.URL url, java.lang.String encoding)
          Loads standard spellings from a URL.
 java.lang.String preprocessSpelling(java.lang.String spelling)
          Preprocess spelling.
 void setLexicon(Lexicon lexicon)
          Set the lexicon.
 void setLogger(Logger logger)
          Set the logger.
 void setMappedSpellings(TaggedStrings mappedSpellings)
          Sets map which maps alternate spellings to standard spellings.
 void setStandardSpellings(java.util.Set<java.lang.String> standardSpellings)
          Sets standard spellings.
 java.lang.String[] standardizeSpelling(java.lang.String spelling)
          Returns standard spellings given a spelling.
 java.lang.String standardizeSpelling(java.lang.String spelling, java.lang.String wordClass)
          Returns a standard spelling given a standard or alternate spelling.
 
Methods inherited from class edu.northwestern.at.utils.IsCloseableObject
close
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

mappedSpellings

protected TaggedStrings mappedSpellings
The map with alternate spellings as keys and standard spellings as values.


standardSpellingSet

protected java.util.Set<java.lang.String> standardSpellingSet
The set of standard spellings.


spellingsByWordClass

protected Map2D<java.lang.String,java.lang.String,java.lang.String> spellingsByWordClass
Irregular forms.

Spellings disambiguated by word class are stored in a HashMap2D. The compound key consists of the word class and alternate spelling, and the value is the standardized spelling.


alternateSpellingsWordClasses

protected java.util.Set<java.lang.String> alternateSpellingsWordClasses
Word classes of alternate spellings.


defaultSpellingsByWordClassFileName

protected static java.lang.String defaultSpellingsByWordClassFileName
Path to list of irregular word forms.


logger

protected Logger logger
Logger used for output.


lexicon

protected Lexicon lexicon
Lexicon associated with this standardizer. May be null.

Constructor Detail

AbstractSpellingStandardizer

public AbstractSpellingStandardizer()
Create abstract spelling standardizer.

Method Detail

loadAlternativeSpellingsByWordClass

public void loadAlternativeSpellingsByWordClass(java.net.URL spellingsURL,
                                                java.lang.String encoding)
                                         throws java.io.IOException
Load alternate to standard spellings by word class.

Specified by:
loadAlternativeSpellingsByWordClass in interface SpellingStandardizer
Parameters:
spellingsURL - URL of alternative spellings by word class.
encoding - Character set encoding for spellings
Throws:
java.io.IOException

loadAlternativeSpellings

public void loadAlternativeSpellings(java.net.URL url,
                                     java.lang.String encoding,
                                     java.lang.String delimChars)
                              throws java.io.IOException
Loads alternate spellings from a URL.

Specified by:
loadAlternativeSpellings in interface SpellingStandardizer
Parameters:
url - URL containing alternate spellings to standard spellings mappings.
encoding - Text encoding (utf-8, 8859_1, etc.).
delimChars - Delimiter characters separating spelling pairs.
Throws:
java.io.IOException

loadAlternativeSpellings

public void loadAlternativeSpellings(java.io.Reader reader,
                                     java.lang.String delimChars)
                              throws java.io.IOException
Loads alternative spellings from a reader.

Specified by:
loadAlternativeSpellings in interface SpellingStandardizer
Parameters:
reader - The reader.
delimChars - Delimiter characters separating spelling pairs.
Throws:
java.io.IOException

loadStandardSpellings

public void loadStandardSpellings(java.net.URL url,
                                  java.lang.String encoding)
                           throws java.io.IOException
Loads standard spellings from a URL.

Specified by:
loadStandardSpellings in interface SpellingStandardizer
Parameters:
url - URL containing standard spellings
encoding - Character set encoding for spellings
Throws:
java.io.IOException

loadStandardSpellings

public void loadStandardSpellings(java.io.Reader reader)
                           throws java.io.IOException
Loads standard spellings from a reader.

Specified by:
loadStandardSpellings in interface SpellingStandardizer
Parameters:
reader - The reader.
Throws:
java.io.IOException

addMappedSpelling

public void addMappedSpelling(java.lang.String alternateSpelling,
                              java.lang.String standardSpelling)
Add a mapped spelling.

Specified by:
addMappedSpelling in interface SpellingStandardizer
Parameters:
alternateSpelling - The alternate spelling.
standardSpelling - The corresponding standard spelling.

addStandardSpelling

public void addStandardSpelling(java.lang.String standardSpelling)
Add a standard spelling.

Specified by:
addStandardSpelling in interface SpellingStandardizer
Parameters:
standardSpelling - A standard spelling.

addStandardSpellings

public void addStandardSpellings(java.util.Collection<java.lang.String> standardSpellings)
Add standard spellings from a collection.

Specified by:
addStandardSpellings in interface SpellingStandardizer
Parameters:
standardSpellings - A collection of standard spellings.

addCachedSpelling

public void addCachedSpelling(java.lang.String alternateSpelling,
                              java.lang.String standardSpelling)
Cached a generated mapped spelling.

Parameters:
alternateSpelling - The alternate spelling.
standardSpelling - The corresponding standard spelling.

setMappedSpellings

public void setMappedSpellings(TaggedStrings mappedSpellings)
Sets map which maps alternate spellings to standard spellings.

Specified by:
setMappedSpellings in interface SpellingStandardizer
Parameters:
mappedSpellings - Map with alternate spellings as keys and standard spellings as values.

setStandardSpellings

public void setStandardSpellings(java.util.Set<java.lang.String> standardSpellings)
Sets standard spellings.

Specified by:
setStandardSpellings in interface SpellingStandardizer
Parameters:
standardSpellings - Set of standard spellings.

standardizeSpelling

public java.lang.String[] standardizeSpelling(java.lang.String spelling)
Returns standard spellings given a spelling.

Specified by:
standardizeSpelling in interface SpellingStandardizer
Parameters:
spelling - The spelling.
Returns:
The standard spellings as an array of String.

If not spelling map is defined, the spelling is returned unchanged.


standardizeSpelling

public java.lang.String standardizeSpelling(java.lang.String spelling,
                                            java.lang.String wordClass)
Returns a standard spelling given a standard or alternate spelling.

Specified by:
standardizeSpelling in interface SpellingStandardizer
Parameters:
spelling - The spelling.
wordClass - The major word class.
Returns:
The standard spelling.

getNumberOfAlternateSpellings

public int getNumberOfAlternateSpellings()
Returns number of alternate spellings.

Specified by:
getNumberOfAlternateSpellings in interface SpellingStandardizer
Returns:
The number of alternate spellings.

getNumberOfAlternateSpellingsByWordClass

public int[] getNumberOfAlternateSpellingsByWordClass()
Returns number of alternate spellings by word class.

Specified by:
getNumberOfAlternateSpellingsByWordClass in interface SpellingStandardizer
Returns:
int array with two entries. [0] = The number of alternate spellings word classes. [1] = The number of alternate spellings in the word classes.

getNumberOfStandardSpellings

public int getNumberOfStandardSpellings()
Returns number of standard spellings.

Specified by:
getNumberOfStandardSpellings in interface SpellingStandardizer
Returns:
The number of standard spellings.

getMappedSpellings

public TaggedStrings getMappedSpellings()
Return the mapped spellings.

Specified by:
getMappedSpellings in interface SpellingStandardizer
Returns:
The spelling tagged strings with (alternate spelling, standard spelling) pairs. May be null if this standardizer does not use such a map.

getStandardSpellings

public java.util.Set<java.lang.String> getStandardSpellings()
Return the standard spellings.

Specified by:
getStandardSpellings in interface SpellingStandardizer
Returns:
The standard spellings as a Set. May be null.

preprocessSpelling

public java.lang.String preprocessSpelling(java.lang.String spelling)
Preprocess spelling.

Specified by:
preprocessSpelling in interface SpellingStandardizer
Parameters:
spelling - Spelling to preprocess.
Returns:
Preprocessed spelling.

By default, no preprocessing is applied; the original spelling is returned unchanged.


fixCapitalization

public java.lang.String fixCapitalization(java.lang.String spelling,
                                          java.lang.String standardSpelling)
Fix capitalization of standardized spelling.

Specified by:
fixCapitalization in interface SpellingStandardizer
Parameters:
spelling - The original spelling.
standardSpelling - The candidate standard spelling.
Returns:
Standard spelling with initial capitalization matching original spelling.

getLogger

public Logger getLogger()
Get the logger.

Specified by:
getLogger in interface UsesLogger
Returns:
The logger.

setLogger

public void setLogger(Logger logger)
Set the logger.

Specified by:
setLogger in interface UsesLogger
Parameters:
logger - The logger.

getLexicon

public Lexicon getLexicon()
Get the word lexicon.

Returns:
The static word lexicon.

setLexicon

public void setLexicon(Lexicon lexicon)
Set the lexicon.

Parameters:
lexicon - Lexicon used for tagging.