NU
IT
Northwestern University Information Technology 
MorphAdorner V2.0  Site Map 
Assume that the part of speech tag for a word depends only upon the previous one or two tags, and that the probability of this tag does not depend upon the probabilities of subsequent tags. How do we find the most probable sequence of tags corresponding to a particular sequence of words? We can look at the sequence of part of speech tags for words as an instance of a Hidden Markov Model HMM). Finding the most probable part of speech tag sequence amounts to finding the most probable sequence of states which the Hidden Markov Model traverses for a particular sequence of words.
The Trigram Tagger in MorphAdorner seeks the most likely part of speech tag for a word given information about the previous two words. More precisely, the tag sequence t_{1}, t_{2}, ..., t_{n}  corresponding to the word sequence w_{1}, w_{2}, ..., w_{n}  is sought which maximizes the following expression:
P(t_{1})P(t_{2}t_{1})PRODUCT(i=3 to n) P(t_{i}t_{i2},t_{i1})PRODUCT(i=1 to n) P(w_{i}t_{i})
where
P(t) Contextual (tag transition) probability P(wt) Lexical (word emission) probability
Each P(w_{i}t_{i}) is estimated using the maximum likelihood estimator:
P_{MLE}(t_{i}t_{i2},t_{i1}) = C(t_{i2},t_{i1},t_{i}) / C(t_{i2},t_{i1})
where C(x) is the observed single or joint frequency for the words or tags. To account for "holes" in the frequencies, where some possible combinations are not observed, we can compute smoothed probabilities which reduce the maximum likelihood estimates a little bit to allow a bit of the overall probability to be assigned to unobserved combinations. By default MorphAdorner uses a simple additive smoother which adds small positive values to the numerator and denominator of the probability estimate above. The numerator value is a small constant such as 0.05, while the denominator value is the numerator constant multiplied by the size of the word lexicon L.
P_{smoothed}(t_{i}t_{i2},t_{i1}) = (C(t_{i2},t_{i1},t_{i}) + 0.05) / (C(t_{i2},t_{i1}) + (0.05 * L))
This works well when the training data size is large (as it is for MorphAdorner). For smaller training sets, deleted interpolation may prove more effective. MorphAdorner provides a deleted interpolation smoother as an option.
Each P(w_{i}t_{i}) is estimated using the maximum likelihood estimator:
P_{MLE}(w_{i}t_{i}) = C(w_{i} , t_{i}) / C(t_{i})
Again, MorphAdorner smooths the maximum likelihood estimates using additive smoothing. This time 0.5 is used as the additive numerator value, and the denominator multiplier is K, the number of potential part of speech tags for the word w_{i}. This is commonly called Lidstone smoothing.
P_{smoothed}(w_{i}t_{i}) = (C(w_{i} , t_{i}) + 0.5) / (C(t_{i}) + (0.5 * K))
It is possible but inefficient to calculate the probabilities using complete enumeration of all possible values of t_{1}, t_{2}, ..., t_{n}. In the worst case the time complexity is n^{T} where T is the number of part of speech tags in the tag set. MorphAdorner applies the Viterbi algorithm, a dynamic programming algorithm to determine the optimal subpaths for each state in the HMM model as the algorithm traverses the model. Subpaths other than the most probable are discarded. MorphAdorner also restricts the search path even further using a beam search which discards paths whose probabilities are too small compared to a specified tolerance value to contribute significantly to the joint probability.
Home  
Welcome  
Announcements and News  
Announcements and news about changes to MorphAdorner  
Documentation  
Documentation for using MorphAdorner  
Download MorphAdorner  
Downloading and installing the MorphAdorner client and server software  
Glossary  
Glossary of MorphAdorner terms  
Helpful References  
Natural language processing references  
Licenses  
Licenses for MorphAdorner and Associated Software  
Server  
Online examples of MorphAdorner Server facilities.  
Talks  
Slides from talks about MorphAdorner.  
Tech Talk  
Technical information for programmers using MorphAdorner 
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. 
Contact Us.
