MorphAdorner: Trigram Tagger Mathematical Background

Trigram Tagger Mathematical Background

Assume that the part of speech tag for a word depends only upon the previous one or two tags, and that the probability of this tag does not depend upon the probabilities of subsequent tags. How do we find the most probable sequence of tags corresponding to a particular sequence of words? We can look at the sequence of part of speech tags for words as an instance of a Hidden Markov Model HMM). Finding the most probable part of speech tag sequence amounts to finding the most probable sequence of states which the Hidden Markov Model traverses for a particular sequence of words.

The Trigram Tagger in MorphAdorner seeks the most likely part of speech tag for a word given information about the previous two words. More precisely, the tag sequence t₁, t₂, ..., t_n -- corresponding to the word sequence w₁, w₂, ..., w_n -- is sought which maximizes the following expression:

P(t₁)P(t₂|t₁)PRODUCT(i=3 to n) P(t_i|t_i-2,t_i-1)PRODUCT(i=1 to n) P(w_i|t_i)

where

P(t) Contextual (tag transition) probability

P(w|t) Lexical (word emission) probability

Each P(w_i|t_i) is estimated using the maximum likelihood estimator:

P_MLE(t_i|t_i-2,t_i-1) = C(t_i-2,t_i-1,t_i) / C(t_i-2,t_i-1)

where C(x) is the observed single or joint frequency for the words or tags. To account for "holes" in the frequencies, where some possible combinations are not observed, we can compute smoothed probabilities which reduce the maximum likelihood estimates a little bit to allow a bit of the overall probability to be assigned to unobserved combinations. By default MorphAdorner uses a simple additive smoother which adds small positive values to the numerator and denominator of the probability estimate above. The numerator value is a small constant such as 0.05, while the denominator value is the numerator constant multiplied by the size of the word lexicon L.

P_smoothed(t_i|t_i-2,t_i-1) = (C(t_i-2,t_i-1,t_i) + 0.05) / (C(t_i-2,t_i-1) + (0.05 * L))

This works well when the training data size is large (as it is for MorphAdorner). For smaller training sets, deleted interpolation may prove more effective. MorphAdorner provides a deleted interpolation smoother as an option.

Each P(w_i|t_i) is estimated using the maximum likelihood estimator:

P_MLE(w_i|t_i) = C(w_i , t_i) / C(t_i)

Again, MorphAdorner smooths the maximum likelihood estimates using additive smoothing. This time 0.5 is used as the additive numerator value, and the denominator multiplier is K, the number of potential part of speech tags for the word w_i. This is commonly called Lidstone smoothing.

P_smoothed(w_i|t_i) = (C(w_i , t_i) + 0.5) / (C(t_i) + (0.5 * K))

It is possible but inefficient to calculate the probabilities using complete enumeration of all possible values of t₁, t₂, ..., t_n. In the worst case the time complexity is n^T where T is the number of part of speech tags in the tag set. MorphAdorner applies the Viterbi algorithm, a dynamic programming algorithm to determine the optimal subpaths for each state in the HMM model as the algorithm traverses the model. Subpaths other than the most probable are discarded. MorphAdorner also restricts the search path even further using a beam search which discards paths whose probabilities are too small compared to a specified tolerance value to contribute significantly to the joint probability.

	Home
	Welcome
	Announcements and News
	Announcements and news about changes to MorphAdorner
	Documentation
	Documentation for using MorphAdorner
	Download MorphAdorner
	Downloading and installing the MorphAdorner client and server software
	Glossary
	Glossary of MorphAdorner terms
	Helpful References
	Natural language processing references
	Licenses
	Licenses for MorphAdorner and Associated Software
	Server
	Online examples of MorphAdorner Server facilities.
	Talks
	Slides from talks about MorphAdorner.
	Tech Talk
	Technical information for programmers using MorphAdorner

P(t)	Contextual (tag transition) probability
P(w\|t)	Lexical (word emission) probability