Create derived MorphAdorner file with word elements stripped of attributes.

See: Description

Package Description

Create derived MorphAdorner file with word elements stripped of attributes.


java input.xml output.xml [/[no]id] [/[no]trim]
input.xmlInput MorphAdornerd xml file.
output.xml Derived adorned file with word element attributes stripped. Tab delimited file of word element attribute values.
/id or /noid Optional parameter indicating xml:id should be left attached to each word (<w>) element. Default is /noid which removes the xml:id attribute and value.
/trim or /notrim Optional parameter indicating whether whitespace should be trimmed from the start and end of each XML text line. Default is /notrim, which leaves the original whitespace intact.

The derived adorned output file "output.xml" has all attributes stripped from each <w> tag.

The attribute values for each "<w>" element in the input.xml file are extracted and output to the tab-separated values file. The order of the attribute lines matches the order of appearance of the <w> elements in the XML output file. When /id is used, the xml:id value in each <w> element in output.xml can be matched with the corresponding xml:id value in .

The first line in contains the attribute names for each column. Each subsequent line in the file contains at least the following information corresponding to a single word "<w>" element. Some adorned files may add extra word attributes, resulting in more columns.

  1. xml:id -- the permanent word ID.
  2. eos -- the end of sentence flag (1 if word ends a sentence, 0 otherwise)
  3. lem -- the lemma.
  4. ord -- the word ordinal within the text (starts at 1)
  5. part -- the word part flag. "N" for a word which is not split; "I" for the first part of a split word; "M" for the middle parts of a split word; and "F" for the final part of a split word.
  6. pos -- the part of speech.
  7. reg -- the standard spelling.
  8. spe -- the corrected original spelling.
  9. tok -- The original token.