NU
IT
Northwestern University Information Technology |
MorphAdorner V2.0 | Site Map |
A program like MorphAdorner assigns a part of speech tag to each token in an input text, e.g., this word token is a noun or this token is a period. This task is difficult since many words can take on more than one part of speech. Determining which part of speech applies to a particular word occurrence depends upon the context in which the word appears.
A set of training data specifies a large number of words along with their potential parts of speech in actual reading contexts. This combination of known words and parts of speech, along with statistical methods and/or context rules, allows a program like MorphAdorner to assign correct parts of speech to words in new texts about 97% of the time, as long as all the words in the new texts are known. That is, the words have been encountered in the training data with all their possible parts of speech, or the words appear in supplemental dictionaries along with their parts of speech.
Unfortunately many words in new texts will not have been seen in the training data and will not occur in a supplemental dictionary. This means a program like MorphAdorner must "guess" the relevant possible parts of speech for an unknown word to assign a proper part of speech tag in context.
MorphAdorner uses a variety of techniques to guess the possible parts of speech for an unknown word. The default MorphAdorner guesser applies the following methods, in order, until at least one potential part of speech is identified. A programmer can modify or replace this default guesser, and several MorphAdorner configuration settings allow you to modify the guessing process as well.
Is the word punctuation?
Examples: period, quote mark, question mark, sequence of periods
Assign the punctuation or punctuation class as the part of speech.
Is the word a symbol?
Examples: A paragraph mark.
Assign the symbol class as the part of speech.
Is the word a cardinal number?
Examples: 12, 12.5
Assign the cardinal number class as the part of speech.
Is the word an ordinal number?
Examples: 1st, 12th
Assign the ordinal number class as the part of speech.
Is the word a currency amount?
Examples: $12.50, 1L, 1£, £10
Assign the cardinal number class as the part of speech.
Is the word a Roman numeral?
Examples: I, V, IX, .IX., .IX, MMM, IIIJ
Assign the cardinal number class as the part of speech. For Roman numerals that can also be initials (I, V) or English pronouns (I), add the proper noun and appropriate pronoun classes as well.
Note that the definition of a Roman numeral is much looser in older texts than is defined in contemporary usage.
Is the word an ordinal Roman numeral?
Examples: xviith
Assign the ordinal number part of speech class.
Is the word hyphenated?
Examples: head-master, sea-serpent
MorphAdorner extracts the part of the word after the last hyphen. If that is a known word, assign its part of speech classes.
The following cases are treated specially.
Is a spelling standardizer defined?
If so, get the parts of speech for the standardized spelling.
Example: "vniversitie" regularizes to "university"
Assign the part of speech classes for "university" if known.
Is the word a proper name?
MorphAdorner defines some auxiliary word lists containing lists of proper names for people and places. If the word appears on one of these "name" lists, assign the proper noun class.
Is the word defined by an auxiliary word list?
MorphAdorner defines some auxiliary word lists which define words and possible part of speech classes for those words. If the word appears on one of these lists, assign the associated part of speech classes defined in the lists.
Is the word an abbreviation?
Examples: U.S., p.m.
If the word appears to be an abbreviation, assign a proper noun class if it begins with a capital letter, or a common noun class if it does not begin with a capital letter.
Is a suffix lexicon defined?
If so, perform the following suffix analysis.
For each successively shorter ending substring of the word, look up that substring in the suffix lexicon. If the substring exists in the suffix lexicon, assign its part of speech classes as those of the unknown word.
Example: reputedly
Look up the successively shorter terminal strings:
reputedly
eputedly
putedly
utedly
tedly
edly
dly
ly
y
and stop at the first of those suffix strings which appears in the suffix lexicon, and use the associated part of speech classes.
Is the word entirely in upper case?
Example: MCDOODLE
Assign the singular proper noun part of speech class.
If all else fails, assume the word is a noun.
If the word begins with a capital letter and ends with "s", assume it is a plural proper noun.
If the word begins with a capital letter and does not end with "s", assume it is a singular proper noun.
If the word does not begin with a capital letter and ends with "s", assume it is a plural common noun.
If the word does not begin with a capital letter and does not end with "s", assume it is a singular common noun.
Home | |
Welcome | |
Announcements and News | |
Announcements and news about changes to MorphAdorner | |
Documentation | |
Documentation for using MorphAdorner | |
Download MorphAdorner | |
Downloading and installing the MorphAdorner client and server software | |
Glossary | |
Glossary of MorphAdorner terms | |
Helpful References | |
Natural language processing references | |
Licenses | |
Licenses for MorphAdorner and Associated Software | |
Server | |
Online examples of MorphAdorner Server facilities. | |
Talks | |
Slides from talks about MorphAdorner. | |
Tech Talk | |
Technical information for programmers using MorphAdorner |
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |
Contact Us.
|