Part of speech tagging is the process of adorning
or "tagging" words in a text with each word's corresponding
part of speech. Part of speech tagging is based both on the meaning
of the word and its positional relationship with adjacent words.
A simple list of the parts of speech for English includes adjective,
adverb, conjunction, noun, preposition, pronoun, and verb. For
computational purposes, however, each of these major word classes
is usually subdivided to reflect more granular syntactic and
MorphAdorner can adorn each spelling in a text with a part of speech.
To do this MorphAdorner requires a definition of the part of
speech tag set, and a training corpus containing a large swatch of text
containing spellings already correctly adorned with their parts of speech.
From this training data MorphAdorner can generate tagging rules,
tag probability matrices, and a lexicon of known words.
MorphAdorner provides several different part of speech taggers.
We expect only two will be widely used.
The MorphAdorner trigram tagger uses a hidden Markov model and
a beam-search variant of the Viterbi algorithm. We expect this
will be the primary tagger. You can read a brief description of
mathematical basis of the trigram tagger.
The MorphAdorner rule-based tagger is a modified version
of Mark Hepple's rule-based tagger. Hepple's tagger is a
variant of Eric Brill's tagger but disallows interaction
between rules. We expect the Hepple tagger to be used as
a secondary tagger to correct the output of the trigram
The MorphAdorner part of speech taggers assign tags to unknown words
using pattern recognition for items
such as numbers and Roman numerals, and suffix analysis with successive
abstraction when the pattern recognition methods fail. For example,
the suffix ly in English often indicates the
word is an adverb, while the suffix ing often
indicates the word is a present participle or gerund (an obvious counterexample is
"spring"). By looking at the statistical distribution of endings
and part of speech tags in the training data, along with the sequence
of previous parts of speech, MorphAdorner can often guess correctly
the part of speech for a word it doesn't know. When all the pattern
recognition methods fail, the word is assumed to be a noun.
You can see a detailed list of the pattern recognition methods
MorphAdorner uses to assign parts of speech to
Part of speech tagging of English texts from the Early Modern English
period to the present raises several problems. Most part of speech
tag sets for English were devised for use with modern texts. These
tag sets lack the necessary tags to represent English usage that was
either current at an earlier time or was archaic at its time of origin but
remained current in restricted discursive environments, such as religion or
poetry. The second person singular of pronouns and verb forms is the
clearest example. An -n form that marks a plural present
is much rarer but not uncommon as a deliberate archaism in Shakespeare's
Modern taggers rely on 's or s'
to identify the possessive case. They also rely on sentence medial
capitalization to extract names. These procedures don't work once you move
back to the 18th century.
By default MorphAdorner uses a part of speech tag set designed by
as it is called, differs from modern tag sets in
recognizing all morphological forms that are found in written English
from Chaucer to the present. Like the tag set used for the Brown corpus
but unlike the Penn Treebank or CLAWS tag sets, NUPOS does not
split the possessive case as a separate token and uses compound tags
for contracted forms.
Part of speech tags tend to be somewhat inconsistent compounds of syntactic
and morphological information. In NUPOS the components of each tag are kept
separately and the grammatical description of each word can be easily
identified at a minimal level of granularity (~20 tags) or at a maximum
level (~230 tags).
We have used the Penn Treebank tag set with MorphAdorner for a
project involving modern news and journal text. MorphAdorner
can use any arbitrary tag set given appropriate training data
and a proper definition of the word class and major word class of
The Trigram tagger assigns the part of speech tag correctly about 96% to 97%
of the time. The accuracy can be expected to improve as the training
You can try MorphAdorner's
trigram part of speech tagger online. This
example only accepts plain text as input.