Using a lemma from the word lexicon
Given a (spelling, NUPOS part of speech) pair, MorphAdorner
first checks if a lemma appears for that combination in the currently
active word lexicon. If so, MorphAdorner returns the lemma
specified by the lexicon
Consider the spelling pair (striking, vvg). MorphAdorner's
19th English lexicon defines the lemma strike for this
combination of spelling and NUPOS part of speech.
Word classes for lemmatization
When the (spelling, part of speech) combination is not found in the
current word lexicon, MorphAdorner uses its general English
lemmatizer which is based upon a list of irregular forms and
grammar rules. The lemmatizer is not tied to a specific
part of speech set. Instead the lemmatizer categorizes irregular forms
and rules using the following major part of speech classes.
- adjective
- adverb
- compound
- conjunction
- infinitive-to
- noun, plural
- noun, possessive
- preposition
- pronoun
- verb
The NUPOS (or other) part of speech is converted to one of these
major word classes for the purposes of lemmatization. In our example
above, the NUPOS gerund tag vvg maps to the verb class. The
lemmatizer then processes the spelling pair (striking, verb) by
first checking the list of irregular forms, and second applying
rules of detachment if needed.
Irregular forms
When the spelling pair appears in the irregular forms list,
the lemmatizer returns the lemma specified in that list.
In our example, striking does not appear on the irregular forms
list.
On the other hand, the spelling pair (mice,noun) does
appear on the irregular forms list, which specifies that
mouse is the lemma form for mice.
Rules of detachment
When the spelling pair does not appear in the irregular forms
list, the lemmatizer begins a series of rule matches
for the the major word class. Each rule specifies an affix
pattern to match and a replacement pattern which generates
the lemma form. Once a replacement has been effected, the
lemmatization process is complete. These rules are often called
rules of detachment because the affixes are detached
from the inflected word form to produce the lemma form.
In the case of striking, the first match occurs against the rule:
CVCing CVCe
which says "match a consonant, followed by a vowel,
followed by a consonant, followed by ing
at the end of the word." The replacement string says to
keep the consonant followed by the vowel followed by the
consonant, but replace ing with
e . The result is that striking
is lemmatized to strike.
Some words require the application of multiple sets of
detachment rules. For example, the word "astoundingly" is
an adverb formed from a present participle. The lemmatizer
first applies the adverb rules to remove the "ly" producing
"astounding", then applies the verb rules to produce
"astound" as the lemma form.
Once a successful substitution occurs, the lemmatization
process stops.
Ambiguous endings
The reduced form for some endings is ambiguous.
For example, the lemma for the past tense of a verb
ending in "ored" may end in "ore" (e.g., implored -> implore)
or in "or" (e.g., colored -> color). To help disambiguate
such cases, a lemmatization rule can specify that the
resulting candidate lemma formed by applying the rule
must appear in a known word list. NUPOS uses a large list
of standard word forms taken from the 1911 Webster's Dictionary
and other sources.
For example, consider the rule sequence:
+ ored ore
ored or
The first rule says to replace "ored" with "ore" and check that
the result is a known word (that's what the "+" denotes).
When the result is not a known word, the rule is bypassed, and the following
rule which replaces "ored" with "or" is used instead.
Examples:
- recolored -> recolore : recolore not in dictionary, go to next rule.
- recolored -> recolor : recolor in dictionary, accept this form as the lemma.
- implored -> implore : implore in dictionary, accept this form as the lemma.
Words containing multiple parts of speech
Words containing more than one part of speech require special
handling. MorphAdorner attempts to split such words at a logical
point and assign a separate lemma using the process above to
each word part. For example, the spelling I'm with a compound
NUPOS part of speech pns11|vam (the vertical bar separates the
parts of speech), is split into two pairs:
The first pair lemmatizes to i and the second pair to be,
giving the compound lemma form i|be.
Certain irregular compound forms such as gimme, a
contraction of "give me", appear under the compound
entry in the irregular forms list. The lemma form for gimme is
give|i.
Punctuation and Symbols
Punctuation and symbols "lemmatize" to themselves.
Foreign words (marked by one of the foreign part of speech tags)
and singular nouns are left untouched by MorphAdorner's lemmatizer --
the original spelling is considered the lemma form.
Ambiguous lemmata
The lemma form for some words is ambiguous. For example, "axes"
is the plural form of both "axe" and "axis". NUPOS returns one of the possible
forms (e.g., "axe" for "axes"). This may not be the correct
form in some cases.
|