Poets that lasting marble seek,
Must carve in Latin or in Greek.
We write in sand, our language grows,
And like the tide, our work o'erflows.

-- Edmund Waller



Northwestern
MorphAdorner
    INFORMATION TECHNOLOGY  
    MorphAdorner Site Map  
MorphAdorner > Part of Speech Tagger > Trigram Tagger Mathematical Background
 
Home
 
Announcements and News
 
Download MorphAdorner
 
Documentation
 
Licenses
 
Glossary
 
Helpful References
 
Tech Talk
 

Language Recognizer
 
Lemmatizer
 
Lexicon Lookup
 
Name Recognizer
 
Parser
 
Part of Speech Tagger
 
Pluralizer
 
Sentence Splitter
 
Spelling Standardizer
 
Text Segmenter
 
Verb Conjugator
 
Word Tokenizer
 
  Trigram Tagger Mathematical Background
 
 

Assume that the part of speech tag for a word depends only upon the previous one or two tags, and that the probability of this tag does not depend upon the probabilities of subsequent tags. How do we find the most probable sequence of tags corresponding to a particular sequence of words? We can look at the sequence of part of speech tags for words as an instance of a Hidden Markov Model HMM). Finding the most probable part of speech tag sequence amounts to finding the most probable sequence of states which the Hidden Markov Model traverses for a particular sequence of words.

The Trigram Tagger in MorphAdorner seeks the most likely part of speech tag for a word given information about the previous two words. More precisely, the tag sequence t1, t2, ..., tn -- corresponding to the word sequence w1, w2, ..., wn -- is sought which maximizes the following expression:

P(t1)P(t2|t1)PRODUCT(i=3 to n) P(ti|ti-2,ti-1)PRODUCT(i=1 to n) P(wi|ti)

where

P(t) Contextual (tag transition) probability
P(w|t) Lexical (word emission) probability

Each P(wi|ti) is estimated using the maximum likelihood estimator:

PMLE(ti|ti-2,ti-1) = C(ti-2,ti-1,ti) / C(ti-2,ti-1)

where C(x) is the observed single or joint frequency for the words or tags. To account for "holes" in the frequencies, where some possible combinations are not observed, we can compute smoothed probabilities which reduce the maximum likelihood estimates a little bit to allow a bit of the overall probability to be assigned to unobserved combinations. By default MorphAdorner uses a simple additive smoother which adds small positive values to the numerator and denominator of the probability estimate above. The numerator value is a small constant such as 0.05, while the denominator value is the numerator constant multiplied by the size of the word lexicon L.

Psmoothed(ti|ti-2,ti-1) = (C(ti-2,ti-1,ti) + 0.05) / (C(ti-2,ti-1) + (0.05 * L))

This works well when the training data size is large (as it is for MorphAdorner). For smaller training sets, deleted interpolation may prove more effective. MorphAdorner provides a deleted interpolation smoother as an option.

Each P(wi|ti) is estimated using the maximum likelihood estimator:

PMLE(wi|ti) = C(wi , ti) / C(ti)

Again, MorphAdorner smooths the maximum likelihood estimates using additive smoothing. This time 0.5 is used as the additive numerator value, and the denominator multiplier is K, the number of potential part of speech tags for the word wi. This is commonly called Lidstone smoothing.

Psmoothed(wi|ti) = (C(wi , ti) + 0.5) / (C(ti) + (0.5 * K))

It is possible but inefficient to calculate the probabilities using complete enumeration of all possible values of t1, t2, ..., tn. In the worst case the time complexity is nT where T is the number of part of speech tags in the tag set. MorphAdorner applies the Viterbi algorithm, a dynamic programming algorithm to determine the optimal subpaths for each state in the HMM model as the algorithm traverses the model. Subpaths other than the most probable are discarded. MorphAdorner also restricts the search path even further using a beam search which discards paths whose probabilities are too small compared to a specified tolerance value to contribute significantly to the joint probability.

 

Information Technology | Academic Technologies | Scholarly Technologies 2East Resource Center |
Northwestern Home | Calendar: Plan-It Purple | Sites A-Z | Search
Academic Technologies  NU Library 2East  1970 Campus Drive  Evanston, IL 60208
E-mail: pib@northwestern.edu
Last updated Sun Mar 15 05:52:54 2009   World Wide Web Disclaimer and University Policy Statements   © 2007, 2008 Northwestern University