Northwestern University Information Technology
MorphAdorner Northwestern
Trigram Tagger Mathematical Background

Assume that the part of speech tag for a word depends only upon the previous one or two tags, and that the probability of this tag does not depend upon the probabilities of subsequent tags. How do we find the most probable sequence of tags corresponding to a particular sequence of words? We can look at the sequence of part of speech tags for words as an instance of a Hidden Markov Model HMM). Finding the most probable part of speech tag sequence amounts to finding the most probable sequence of states which the Hidden Markov Model traverses for a particular sequence of words.

The Trigram Tagger in MorphAdorner seeks the most likely part of speech tag for a word given information about the previous two words. More precisely, the tag sequence t1, t2, ..., tn -- corresponding to the word sequence w1, w2, ..., wn -- is sought which maximizes the following expression:

P(t1)P(t2|t1)PRODUCT(i=3 to n) P(ti|ti-2,ti-1)PRODUCT(i=1 to n) P(wi|ti)


P(t) Contextual (tag transition) probability
P(w|t) Lexical (word emission) probability

Each P(wi|ti) is estimated using the maximum likelihood estimator:

PMLE(ti|ti-2,ti-1) = C(ti-2,ti-1,ti) / C(ti-2,ti-1)

where C(x) is the observed single or joint frequency for the words or tags. To account for "holes" in the frequencies, where some possible combinations are not observed, we can compute smoothed probabilities which reduce the maximum likelihood estimates a little bit to allow a bit of the overall probability to be assigned to unobserved combinations. By default MorphAdorner uses a simple additive smoother which adds small positive values to the numerator and denominator of the probability estimate above. The numerator value is a small constant such as 0.05, while the denominator value is the numerator constant multiplied by the size of the word lexicon L.

Psmoothed(ti|ti-2,ti-1) = (C(ti-2,ti-1,ti) + 0.05) / (C(ti-2,ti-1) + (0.05 * L))

This works well when the training data size is large (as it is for MorphAdorner). For smaller training sets, deleted interpolation may prove more effective. MorphAdorner provides a deleted interpolation smoother as an option.

Each P(wi|ti) is estimated using the maximum likelihood estimator:

PMLE(wi|ti) = C(wi , ti) / C(ti)

Again, MorphAdorner smooths the maximum likelihood estimates using additive smoothing. This time 0.5 is used as the additive numerator value, and the denominator multiplier is K, the number of potential part of speech tags for the word wi. This is commonly called Lidstone smoothing.

Psmoothed(wi|ti) = (C(wi , ti) + 0.5) / (C(ti) + (0.5 * K))

It is possible but inefficient to calculate the probabilities using complete enumeration of all possible values of t1, t2, ..., tn. In the worst case the time complexity is nT where T is the number of part of speech tags in the tag set. MorphAdorner applies the Viterbi algorithm, a dynamic programming algorithm to determine the optimal subpaths for each state in the HMM model as the algorithm traverses the model. Subpaths other than the most probable are discarded. MorphAdorner also restricts the search path even further using a beam search which discards paths whose probabilities are too small compared to a specified tolerance value to contribute significantly to the joint probability.

Announcements and News
Download MorphAdorner
Helpful References
Tech Talk