NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Guessing Parts Of Speech For Unknown Words

A program like MorphAdorner assigns a part of speech tag to each token in an input text, e.g., this word token is a noun or this token is a period. This task is difficult since many words can take on more than one part of speech. Determining which part of speech applies to a particular word occurrence depends upon the context in which the word appears.

A set of training data specifies a large number of words along with their potential parts of speech in actual reading contexts. This combination of known words and parts of speech, along with statistical methods and/or context rules, allows a program like MorphAdorner to assign correct parts of speech to words in new texts about 97% of the time, as long as all the words in the new texts are known. That is, the words have been encountered in the training data with all their possible parts of speech, or the words appear in supplemental dictionaries along with their parts of speech.

Unfortunately many words in new texts will not have been seen in the training data and will not occur in a supplemental dictionary. This means a program like MorphAdorner must "guess" the relevant possible parts of speech for an unknown word to assign a proper part of speech tag in context.

MorphAdorner uses a variety of techniques to guess the possible parts of speech for an unknown word. The default MorphAdorner guesser applies the following methods, in order, until at least one potential part of speech is identified. A programmer can modify or replace this default guesser, and several MorphAdorner configuration settings allow you to modify the guessing process as well.

  1. Is the word punctuation?

    Examples: period, quote mark, question mark, sequence of periods

    Assign the punctuation or punctuation class as the part of speech.

  2. Is the word a symbol?

    Examples: A paragraph mark.

    Assign the symbol class as the part of speech.

  3. Is the word a cardinal number?

    Examples: 12, 12.5

    Assign the cardinal number class as the part of speech.

  4. Is the word an ordinal number?

    Examples: 1st, 12th

    Assign the ordinal number class as the part of speech.

  5. Is the word a currency amount?

    Examples: $12.50, 1L, 1�, �10

    Assign the cardinal number class as the part of speech.

  6. Is the word a Roman numeral?

    Examples: I, V, IX, .IX., .IX, MMM, IIIJ

    Assign the cardinal number class as the part of speech. For Roman numerals that can also be initials (I, V) or English pronouns (I), add the proper noun and appropriate pronoun classes as well.

    Note that the definition of a Roman numeral is much looser in older texts than is defined in contemporary usage.

  7. Is the word an ordinal Roman numeral?

    Examples: xviith

    Assign the ordinal number part of speech class.

  8. Is the word hyphenated?

    Examples: head-master, sea-serpent

    MorphAdorner extracts the part of the word after the last hyphen. If that is a known word, assign its part of speech classes.

    The following cases are treated specially.

    • a letter followed by ---'s is considered a possessive noun.
    • ---'s or ---'S is considered a possessive noun.
    • a letter followed by --- is considered a proper or common possessive noun, or an exclamation.
  9. Is a spelling standardizer defined?

    If so, get the parts of speech for the standardized spelling.

    Example: "vniversitie" regularizes to "university"

    Assign the part of speech classes for "university" if known.

  10. Is the word a proper name?

    MorphAdorner defines some auxiliary word lists containing lists of proper names for people and places. If the word appears on one of these "name" lists, assign the proper noun class.

  11. Is the word defined by an auxiliary word list?

    MorphAdorner defines some auxiliary word lists which define words and possible part of speech classes for those words. If the word appears on one of these lists, assign the associated part of speech classes defined in the lists.

  12. Is the word an abbreviation?

    Examples: U.S., p.m.

    If the word appears to be an abbreviation, assign a proper noun class if it begins with a capital letter, or a common noun class if it does not begin with a capital letter.

  13. Is a suffix lexicon defined?

    If so, perform the following suffix analysis.

    For each successively shorter ending substring of the word, look up that substring in the suffix lexicon. If the substring exists in the suffix lexicon, assign its part of speech classes as those of the unknown word.

    Example: reputedly

    Look up the successively shorter terminal strings:

    reputedly
    eputedly
    putedly
    utedly
    tedly
    edly
    dly
    ly
    y

    and stop at the first of those suffix strings which appears in the suffix lexicon, and use the associated part of speech classes.

  14. Is the word entirely in upper case?

    Example: MCDOODLE

    Assign the singular proper noun part of speech class.

  15. If all else fails, assume the word is a noun.

    If the word begins with a capital letter and ends with "s", assume it is a plural proper noun.

    If the word begins with a capital letter and does not end with "s", assume it is a singular proper noun.

    If the word does not begin with a capital letter and ends with "s", assume it is a plural common noun.

    If the word does not begin with a capital letter and does not end with "s", assume it is a singular common noun.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk