NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Comparing Tabular Files

TagDiff compares two tabular files containing spellings and part of speech tags.

Usage:

tagdiff input1.tab postagcol1 input2.tab postagcol2

where

  • input1.tab is an input tab-separated file containing spellings in the first column and parts of speech in the second column. Usually this is a reference (training) file in which the part of speech assignments are known to be correct.
  • postagcol1 is the column number (starting at 1) which contains the part of speech tags in the first file.
  • input2.tab is an input tab-separated file containing spellings in the first column and parts of speech in the second column. Usually this is a file produced by MorphAdorner or some other part of speech tagger.
  • postagcol2 is the column number (starting at 1) which contains the part of speech tags in the second file.

The two files must have the exact same number of lines and the same exact spellings, in order, in column one. However, blank lines are ignored in both files.

TagDiff writes a report to the standard system output file tallying the numbers and types of differences in the part of speech assignments provided by each file. If the first file is a reference file, this allows you to see how well the part of speech tagger reproduced the reference tagging. A good part of speech tagger for English normally gets at least 96% of the tags correct.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk