Package edu.northwestern.at.morphadorner.corpuslinguistics.lemmatizer

Lemmatization.

See: Description

Package edu.northwestern.at.morphadorner.corpuslinguistics.lemmatizer Description

Lemmatization.

Lemmatization is the process of reducing an inflected spelling to its lexical root or lemma form. The lemma form is the base form or head word form you would find in a dictionary. The combination of the lemma form with its word class (noun, verb. etc.) is called the lexeme.

In English, the base form for a verb is the simple infinitive. For example, the gerund "striking" and the past form "struck" are both forms of the lemma "(to) strike". The base form for a noun is the singular form. For example, the plural "mice" is a form of the lemma "mouse."

Most English spellings can be lemmatized using regular rules of English grammar, as long as the word class is known. MorphAdorner uses a list of about 150 such rules. Some spellings require special handling because they don't follow the general rules. These irregular forms include "strong" verbs like "to catch" and nouns like "mouse." MorphAdorner includes a list of about 1,500 irregular forms.

The lemma form of a spelling depends upon its word class. Thus the noun "bee" has "bee" as a lemma form, while "bee" as a verb has "(to) be" as a lemma form. This turns out to be a bigger problem in Early Modern English than in contemporary English because spelling was not reasonably standardized until the late eighteenth century. Using a standard spelling helps in finding the lemma form. For example, the gerund "strykynge" is an old spelling for "striking." By transforming the old spelling to a standardized (usually modern) spelling, we can apply the standard lemmatization rules and obtain "(to) strike" as the lemma. MorphAdorner's English lemmatizer works best with standardized spellings.

Another problem area is the use of the "'s" as a possessive. Sixteenth and seventeenth century English texts generally did not use the "'s" for the possessive form. Thus a phrase like "his majesty's horses" might appear as "his majesties horses." Handling this problem requires part of speech tagging in tandem with spelling standardization.

Not so trivial is the disambiguation of homonyms like 'lie' or 'bark'. There are a few hundred (at most) such pairs in English. In the future we may be able to distinguish which homonym is meant in some situations using methods collectively called word sense disambiguation. That would allow more accurate lemmatization for homonyms.

All MorphAdorner lemmatizers must implement the Lemmatizer interface. The LemmatizerFactory provides the mechanism for instantiating a default or specified instance of a lemmatizer implementation.