public class RuleBasedLemmatizer extends AbstractLemmatizer implements Lemmatizer
Modifier and Type | Field and Description |
---|---|
protected Map2D<java.lang.String,java.lang.String,java.lang.String> |
irregularForms
Irregular forms.
|
protected java.util.Set<java.lang.String> |
irregularFormsWordClasses
Word classes of irregular forms.
|
protected java.util.Map<java.lang.String,java.util.List<LemmatizerRule>> |
rules
Lemmatizing rules.
|
protected java.util.Set<java.lang.String> |
rulesWordClasses
Word classes covered by rules.
|
dictionary, lemmaSeparator, lemmaSeparatorString, lexicon, logger
Constructor and Description |
---|
RuleBasedLemmatizer()
Create a rule-based lemmatizer.
|
Modifier and Type | Method and Description |
---|---|
java.lang.String |
cleanUpLemma(java.lang.String lemma)
Clean up lemma.
|
java.lang.String |
lemmatize(java.lang.String spelling)
Returns a lemma given a spelling.
|
java.lang.String |
lemmatize(java.lang.String spelling,
java.lang.String wordClass)
Returns a lemma given a word and a word class.
|
void |
loadIrregularForms(java.net.URL url,
java.lang.String encoding)
Loads irregular forms from a URL.
|
void |
loadRules(java.net.URL url,
java.lang.String encoding)
Loads lemmatization rules from a URL.
|
cantLemmatize, countLemmata, getLemmaSeparator, getLogger, isCompoundLemma, joinLemmata, joinLemmata, setDictionary, setLexicon, setLogger, splitLemma
close
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
cantLemmatize, countLemmata, getLemmaSeparator, isCompoundLemma, joinLemmata, joinLemmata, setDictionary, setLexicon, splitLemma
protected Map2D<java.lang.String,java.lang.String,java.lang.String> irregularForms
Irregular forms are stored in a HashMap2D. The compound key consists of the word class and irregular word form, and the value is the lemma.
protected java.util.Set<java.lang.String> irregularFormsWordClasses
protected java.util.Map<java.lang.String,java.util.List<LemmatizerRule>> rules
The rules are stored in a map with the word class as a key and a list of LemmatizerRule entries as the value.
protected java.util.Set<java.lang.String> rulesWordClasses
public RuleBasedLemmatizer() throws java.lang.Exception
java.lang.Exception
public void loadRules(java.net.URL url, java.lang.String encoding) throws java.io.IOException
url
- URL containing lemmatization rules.encoding
- Character set encoding for rules.java.io.IOException
public void loadIrregularForms(java.net.URL url, java.lang.String encoding) throws java.io.IOException
url
- URL containing irregular forms.encoding
- Character set encoding for irregular forms.java.io.IOException
public java.lang.String lemmatize(java.lang.String spelling, java.lang.String wordClass)
lemmatize
in interface Lemmatizer
lemmatize
in class AbstractLemmatizer
spelling
- The spelling.wordClass
- The word class. Ignored if null or empty.
May contain more than one word class
separated by commas, in which case
the lemma rules for each class are applied
in order.public java.lang.String cleanUpLemma(java.lang.String lemma)
lemma
- The lemma to clean.A lemma may contain extraneous "!" characters added to ensure a specific ending is retained. The "!" marks are removed here.
public java.lang.String lemmatize(java.lang.String spelling)
lemmatize
in interface Lemmatizer
lemmatize
in class AbstractLemmatizer
spelling
- The spelling.