NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Stripping Word Attributes

StripWordAttributes creates a derived MorphAdorner XML file with word elements stripped of attributes.

Usage:

stripwordattributes input.xml output.xml output.tab [/[no]id] [/[no]trim]

where

input.xmlInput MorphAdorned xml file.
output.xml Derived adorned file with word element attributes stripped.
output.tab Tab delimited file of word element attribute values.
/id or /noid Optional parameter indicating xml:id should be left attached to each word (<w>) element. Default is /noid which removes the xml:id attribute and value.
/trim or /notrim Optional parameter indicating whether whitespace should be trimmed from the start and end of each XML text line. Default is /notrim, which leaves the original whitespace intact.

The derived adorned output file output.xml has all attributes stripped from each <w> tag.

The attribute values for each "<w>" element in the input.xml file are extracted and output to the tab-separated values output.tab file. The order of the attribute lines matches the order of appearance of the <w> elements in the XML output file. When /id is specified the xml:id value in each <w> element in output.xml can be matched with the corresponding xml:id value in output.tab .

The first line in output.tab contains the attribute names for each column. Each subsequent line in the output.tab file contains at least the following information corresponding to a single word "<w>" element. Some adorned files may add extra word attributes, resulting in more columns.

  1. xml:id -- the permanent word ID.
  2. eos -- the end of sentence flag (1 if word ends a sentence, 0 otherwise)
  3. lem -- the lemma.
  4. ord -- the word ordinal within the text (starts at 1)
  5. part -- the word part flag. "N" for a word which is not split; "I" for the first part of a split word; "M" for the middle parts of a split word; and "F" for the final part of a split word.
  6. pos -- the part of speech.
  7. reg -- the standard spelling.
  8. spe -- the corrected original spelling.
  9. tok -- The original token.
Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk