NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
English Lemmatization Process

Using a lemma from the word lexicon

Given a (spelling, NUPOS part of speech) pair, MorphAdorner first checks if a lemma appears for that combination in the currently active word lexicon. If so, MorphAdorner returns the lemma specified by the lexicon

Consider the spelling pair (striking, vvg). MorphAdorner's 19th English lexicon defines the lemma strike for this combination of spelling and NUPOS part of speech.

Word classes for lemmatization

When the (spelling, part of speech) combination is not found in the current word lexicon, MorphAdorner uses its general English lemmatizer which is based upon a list of irregular forms and grammar rules. The lemmatizer is not tied to a specific part of speech set. Instead the lemmatizer categorizes irregular forms and rules using the following major part of speech classes.

  • adjective
  • adverb
  • compound
  • conjunction
  • infinitive-to
  • noun, plural
  • noun, possessive
  • preposition
  • pronoun
  • verb

The NUPOS (or other) part of speech is converted to one of these major word classes for the purposes of lemmatization. In our example above, the NUPOS gerund tag vvg maps to the verb class. The lemmatizer then processes the spelling pair (striking, verb) by first checking the list of irregular forms, and second applying rules of detachment if needed.

Irregular forms

When the spelling pair appears in the irregular forms list, the lemmatizer returns the lemma specified in that list.

In our example, striking does not appear on the irregular forms list.

On the other hand, the spelling pair (mice,noun) does appear on the irregular forms list, which specifies that mouse is the lemma form for mice.

Rules of detachment

When the spelling pair does not appear in the irregular forms list, the lemmatizer begins a series of rule matches for the the major word class. Each rule specifies an affix pattern to match and a replacement pattern which generates the lemma form. Once a replacement has been effected, the lemmatization process is complete. These rules are often called rules of detachment because the affixes are detached from the inflected word form to produce the lemma form.

In the case of striking, the first match occurs against the rule:

CVCing CVCe

which says "match a consonant, followed by a vowel, followed by a consonant, followed by ing at the end of the word." The replacement string says to keep the consonant followed by the vowel followed by the consonant, but replace ing with e . The result is that striking is lemmatized to strike.

Some words require the application of multiple sets of detachment rules. For example, the word "astoundingly" is an adverb formed from a present participle. The lemmatizer first applies the adverb rules to remove the "ly" producing "astounding", then applies the verb rules to produce "astound" as the lemma form.

Once a successful substitution occurs, the lemmatization process stops.

Ambiguous endings

The reduced form for some endings is ambiguous. For example, the lemma for the past tense of a verb ending in "ored" may end in "ore" (e.g., implored -> implore) or in "or" (e.g., colored -> color). To help disambiguate such cases, a lemmatization rule can specify that the resulting candidate lemma formed by applying the rule must appear in a known word list. NUPOS uses a large list of standard word forms taken from the 1911 Webster's Dictionary and other sources.

For example, consider the rule sequence:

+ ored ore
ored or

The first rule says to replace "ored" with "ore" and check that the result is a known word (that's what the "+" denotes). When the result is not a known word, the rule is bypassed, and the following rule which replaces "ored" with "or" is used instead.

Examples:

  • recolored -> recolore : recolore not in dictionary, go to next rule.
  • recolored -> recolor : recolor in dictionary, accept this form as the lemma.
  • implored -> implore : implore in dictionary, accept this form as the lemma.

Words containing multiple parts of speech

Words containing more than one part of speech require special handling. MorphAdorner attempts to split such words at a logical point and assign a separate lemma using the process above to each word part. For example, the spelling I'm with a compound NUPOS part of speech pns11|vam (the vertical bar separates the parts of speech), is split into two pairs:

  • (I,pns11)
  • ('m,vam)

The first pair lemmatizes to i and the second pair to be, giving the compound lemma form i|be.

Certain irregular compound forms such as gimme, a contraction of "give me", appear under the compound entry in the irregular forms list. The lemma form for gimme is give|i.

Punctuation and Symbols

Punctuation and symbols "lemmatize" to themselves. Foreign words (marked by one of the foreign part of speech tags) and singular nouns are left untouched by MorphAdorner's lemmatizer -- the original spelling is considered the lemma form.

Ambiguous lemmata

The lemma form for some words is ambiguous. For example, "axes" is the plural form of both "axe" and "axis". NUPOS returns one of the possible forms (e.g., "axe" for "axes"). This may not be the correct form in some cases.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk