NU
IT
Northwestern University Information Technology |
MorphAdorner V2.0 | Site Map |
Given a (spelling, NUPOS part of speech) pair, MorphAdorner first checks if a lemma appears for that combination in the currently active word lexicon. If so, MorphAdorner returns the lemma specified by the lexicon
Consider the spelling pair (striking, vvg). MorphAdorner's 19th English lexicon defines the lemma strike for this combination of spelling and NUPOS part of speech.
When the (spelling, part of speech) combination is not found in the current word lexicon, MorphAdorner uses its general English lemmatizer which is based upon a list of irregular forms and grammar rules. The lemmatizer is not tied to a specific part of speech set. Instead the lemmatizer categorizes irregular forms and rules using the following major part of speech classes.
The NUPOS (or other) part of speech is converted to one of these major word classes for the purposes of lemmatization. In our example above, the NUPOS gerund tag vvg maps to the verb class. The lemmatizer then processes the spelling pair (striking, verb) by first checking the list of irregular forms, and second applying rules of detachment if needed.
When the spelling pair appears in the irregular forms list, the lemmatizer returns the lemma specified in that list.
In our example, striking does not appear on the irregular forms list.
On the other hand, the spelling pair (mice,noun) does appear on the irregular forms list, which specifies that mouse is the lemma form for mice.
When the spelling pair does not appear in the irregular forms list, the lemmatizer begins a series of rule matches for the the major word class. Each rule specifies an affix pattern to match and a replacement pattern which generates the lemma form. Once a replacement has been effected, the lemmatization process is complete. These rules are often called rules of detachment because the affixes are detached from the inflected word form to produce the lemma form.
In the case of striking, the first match occurs against the rule:
CVCing CVCe
which says "match a consonant, followed by a vowel, followed by a consonant, followed by ing at the end of the word." The replacement string says to keep the consonant followed by the vowel followed by the consonant, but replace ing with e . The result is that striking is lemmatized to strike.
Some words require the application of multiple sets of detachment rules. For example, the word "astoundingly" is an adverb formed from a present participle. The lemmatizer first applies the adverb rules to remove the "ly" producing "astounding", then applies the verb rules to produce "astound" as the lemma form.
Once a successful substitution occurs, the lemmatization process stops.
The reduced form for some endings is ambiguous. For example, the lemma for the past tense of a verb ending in "ored" may end in "ore" (e.g., implored -> implore) or in "or" (e.g., colored -> color). To help disambiguate such cases, a lemmatization rule can specify that the resulting candidate lemma formed by applying the rule must appear in a known word list. NUPOS uses a large list of standard word forms taken from the 1911 Webster's Dictionary and other sources.
For example, consider the rule sequence:
+ ored ore
ored or
The first rule says to replace "ored" with "ore" and check that the result is a known word (that's what the "+" denotes). When the result is not a known word, the rule is bypassed, and the following rule which replaces "ored" with "or" is used instead.
Examples:
Words containing more than one part of speech require special handling. MorphAdorner attempts to split such words at a logical point and assign a separate lemma using the process above to each word part. For example, the spelling I'm with a compound NUPOS part of speech pns11|vam (the vertical bar separates the parts of speech), is split into two pairs:
The first pair lemmatizes to i and the second pair to be, giving the compound lemma form i|be.
Certain irregular compound forms such as gimme, a contraction of "give me", appear under the compound entry in the irregular forms list. The lemma form for gimme is give|i.
Punctuation and symbols "lemmatize" to themselves. Foreign words (marked by one of the foreign part of speech tags) and singular nouns are left untouched by MorphAdorner's lemmatizer -- the original spelling is considered the lemma form.
The lemma form for some words is ambiguous. For example, "axes" is the plural form of both "axe" and "axis". NUPOS returns one of the possible forms (e.g., "axe" for "axes"). This may not be the correct form in some cases.
Home | |
Welcome | |
Announcements and News | |
Announcements and news about changes to MorphAdorner | |
Documentation | |
Documentation for using MorphAdorner | |
Download MorphAdorner | |
Downloading and installing the MorphAdorner client and server software | |
Glossary | |
Glossary of MorphAdorner terms | |
Helpful References | |
Natural language processing references | |
Licenses | |
Licenses for MorphAdorner and Associated Software | |
Server | |
Online examples of MorphAdorner Server facilities. | |
Talks | |
Slides from talks about MorphAdorner. | |
Tech Talk | |
Technical information for programmers using MorphAdorner |
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |
Contact Us.
|