NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Generating Tag Transition Probabilities

NGramTaggerTrainer merges the contents of multiple word list files into a single file. A word list file contains a list of words, one word on each line.

Usage:

ngramtaggertrainer trainingdata.tab wordlexicon.lex transitionmatrix.mat

where

  • trainingdata.tab -- input training data file.
  • wordlexicon.lex -- input MorphAdorner lexicon.
  • transitionmatrix.mat -- output tag transition matrix file.

The training data file is a tab-separated utf-8 file containing the part of speech training data generated from the training texts. We only use the first two columns of the training data.

  1. The original token (spelling).
  2. The NUPOS part of speech.

The word lexicon is a MorphAdorner format word lexicon.

The output tag transition file is a utf-8 file containing the data needed by the MorphAdorner bigram and trigram taggers.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk