public class RuleBasedLemmatizer extends AbstractLemmatizer implements Lemmatizer
| Modifier and Type | Field and Description |
|---|---|
protected Map2D<java.lang.String,java.lang.String,java.lang.String> |
irregularForms
Irregular forms.
|
protected java.util.Set<java.lang.String> |
irregularFormsWordClasses
Word classes of irregular forms.
|
protected java.util.Map<java.lang.String,java.util.List<LemmatizerRule>> |
rules
Lemmatizing rules.
|
protected java.util.Set<java.lang.String> |
rulesWordClasses
Word classes covered by rules.
|
dictionary, lemmaSeparator, lemmaSeparatorString, lexicon, logger| Constructor and Description |
|---|
RuleBasedLemmatizer()
Create a rule-based lemmatizer.
|
| Modifier and Type | Method and Description |
|---|---|
java.lang.String |
cleanUpLemma(java.lang.String lemma)
Clean up lemma.
|
java.lang.String |
lemmatize(java.lang.String spelling)
Returns a lemma given a spelling.
|
java.lang.String |
lemmatize(java.lang.String spelling,
java.lang.String wordClass)
Returns a lemma given a word and a word class.
|
void |
loadIrregularForms(java.net.URL url,
java.lang.String encoding)
Loads irregular forms from a URL.
|
void |
loadRules(java.net.URL url,
java.lang.String encoding)
Loads lemmatization rules from a URL.
|
cantLemmatize, countLemmata, getLemmaSeparator, getLogger, isCompoundLemma, joinLemmata, joinLemmata, setDictionary, setLexicon, setLogger, splitLemmacloseclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitcantLemmatize, countLemmata, getLemmaSeparator, isCompoundLemma, joinLemmata, joinLemmata, setDictionary, setLexicon, splitLemmaprotected Map2D<java.lang.String,java.lang.String,java.lang.String> irregularForms
Irregular forms are stored in a HashMap2D. The compound key consists of the word class and irregular word form, and the value is the lemma.
protected java.util.Set<java.lang.String> irregularFormsWordClasses
protected java.util.Map<java.lang.String,java.util.List<LemmatizerRule>> rules
The rules are stored in a map with the word class as a key and a list of LemmatizerRule entries as the value.
protected java.util.Set<java.lang.String> rulesWordClasses
public RuleBasedLemmatizer()
throws java.lang.Exception
java.lang.Exceptionpublic void loadRules(java.net.URL url,
java.lang.String encoding)
throws java.io.IOException
url - URL containing lemmatization rules.encoding - Character set encoding for rules.java.io.IOExceptionpublic void loadIrregularForms(java.net.URL url,
java.lang.String encoding)
throws java.io.IOException
url - URL containing irregular forms.encoding - Character set encoding for irregular forms.java.io.IOExceptionpublic java.lang.String lemmatize(java.lang.String spelling,
java.lang.String wordClass)
lemmatize in interface Lemmatizerlemmatize in class AbstractLemmatizerspelling - The spelling.wordClass - The word class. Ignored if null or empty.
May contain more than one word class
separated by commas, in which case
the lemma rules for each class are applied
in order.public java.lang.String cleanUpLemma(java.lang.String lemma)
lemma - The lemma to clean.A lemma may contain extraneous "!" characters added to ensure a specific ending is retained. The "!" marks are removed here.
public java.lang.String lemmatize(java.lang.String spelling)
lemmatize in interface Lemmatizerlemmatize in class AbstractLemmatizerspelling - The spelling.