NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Adding Character Offsets

AddCharacterOffsets creates derived MorphAdorner files with character offsets to word tokens.

Usage:

addcharacteroffsets adornedinput.xml adornedoutput.xml unadornedoutput.xml

where

adornedinput.xml Standard MorphAdorner adorned output file.
adornedoutput.xml Derived adorned file with character offsets added to tags.
unadornedoutput.xml Derived unadorned file whose word offsets are given in adornedoutput.xml file.

The derived adorned output file adornedoutput.xml adds a cof= attribute to each <w> tag. The cof= attribute specifies the character (not byte) offset of each word in the unadornedoutput.xml file. The latter file removes the <w> and <c> tags from the adorned input file and outputs the word and whitespace text as specified by the <w> and <c> tags. (Note that cof= is not recognized by the TEI-Analytics scheme.)

The source code for AddCharacterOffsets is interesting in that it shows how to process an adorned file using regular expressions instead of a full XML parser.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk