NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Part of Speech Tagger

Part of speech tagging is the process of adorning or "tagging" words in a text with each word's corresponding part of speech. Part of speech tagging is based both on the meaning of the word and its positional relationship with adjacent words. A simple list of the parts of speech for English includes adjective, adverb, conjunction, noun, preposition, pronoun, and verb. For computational purposes, however, each of these major word classes is usually subdivided to reflect more granular syntactic and morphological structure.

MorphAdorner can adorn each spelling in a text with a part of speech. To do this MorphAdorner requires a definition of the part of speech tag set, and a training corpus containing a large swatch of text containing spellings already correctly adorned with their parts of speech. From this training data MorphAdorner can generate tagging rules, tag probability matrices, and a lexicon of known words.

MorphAdorner provides several different part of speech taggers. We expect only two will be widely used.

  • The MorphAdorner trigram tagger uses a hidden Markov model and a beam-search variant of the Viterbi algorithm. We expect this will be the primary tagger. You can read a brief description of mathematical basis of the trigram tagger.

  • The MorphAdorner rule-based tagger is a modified version of Mark Hepple's rule-based tagger. Hepple's tagger is a variant of Eric Brill's tagger but disallows interaction between rules. We expect the Hepple tagger to be used as a secondary tagger to correct the output of the trigram tagger.

The MorphAdorner part of speech taggers assign tags to unknown words using pattern recognition for items such as numbers and Roman numerals, and suffix analysis with successive abstraction when the pattern recognition methods fail. For example, the suffix ly in English often indicates the word is an adverb, while the suffix ing often indicates the word is a present participle or gerund (an obvious counterexample is "spring"). By looking at the statistical distribution of endings and part of speech tags in the training data, along with the sequence of previous parts of speech, MorphAdorner can often guess correctly the part of speech for a word it doesn't know. When all the pattern recognition methods fail, the word is assumed to be a noun.

You can see a detailed list of the pattern recognition methods MorphAdorner uses to assign parts of speech to unknown words.

Part of speech tagging of English texts from the Early Modern English period to the present raises several problems. Most part of speech tag sets for English were devised for use with modern texts. These tag sets lack the necessary tags to represent English usage that was either current at an earlier time or was archaic at its time of origin but remained current in restricted discursive environments, such as religion or poetry. The second person singular of pronouns and verb forms is the clearest example. An -n form that marks a plural present is much rarer but not uncommon as a deliberate archaism in Shakespeare's time.

Modern taggers rely on 's or s' to identify the possessive case. They also rely on sentence medial capitalization to extract names. These procedures don't work once you move back to the 18th century.

By default MorphAdorner uses a part of speech tag set designed by Martin Mueller. NUPOS as it is called, differs from modern tag sets in recognizing all morphological forms that are found in written English from Chaucer to the present. Like the tag set used for the Brown corpus but unlike the Penn Treebank or CLAWS tag sets, NUPOS does not split the possessive case as a separate token and uses compound tags for contracted forms.

Part of speech tags tend to be somewhat inconsistent compounds of syntactic and morphological information. In NUPOS the components of each tag are kept separately and the grammatical description of each word can be easily identified at a minimal level of granularity (~20 tags) or at a maximum level (~230 tags).

We have used the Penn Treebank tag set with MorphAdorner for a project involving modern news and journal text. MorphAdorner can use any arbitrary tag set given appropriate training data and a proper definition of the word class and major word class of each tag.

The Trigram tagger assigns the part of speech tag correctly about 96% to 97% of the time. The accuracy can be expected to improve as the training lexicon grows.

You can try MorphAdorner's trigram part of speech tagger online. This example only accepts plain text as input.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk