|
StripWordAttributes
creates a derived MorphAdorner XML file with word elements
stripped of attributes.
Usage:
stripwordattributes input.xml output.xml output.tab [/[no]id] [/[no]trim]
where
| input.xml | Input MorphAdorned xml file. |
| output.xml |
Derived adorned file with word element attributes stripped. |
| output.tab |
Tab delimited file of word element attribute values. |
| /id or /noid |
Optional parameter indicating xml:id should be left attached
to each word (<w>) element. Default is /noid which removes the
xml:id attribute and value. |
| /trim or /notrim |
Optional parameter indicating whether whitespace
should be trimmed from the start and end of each XML text line.
Default is /notrim, which leaves the original whitespace intact. |
The derived adorned output file output.xml has all attributes
stripped from each <w> tag.
The attribute values for each "<w>" element in the
input.xml file
are extracted and output to the tab-separated values output.tab
file. The order of the attribute lines matches the order of appearance
of the <w> elements in the XML output file. When
/id is specified the
xml:id value in each <w> element in
output.xml can be matched
with the corresponding xml:id value in output.tab .
The first line in output.tab contains the attribute names for
each column. Each subsequent line in the output.tab file contains
at least the following information corresponding to a single word
"<w>" element. Some adorned files may add extra
word attributes, resulting in more columns.
- xml:id -- the permanent word ID.
- eos -- the end of sentence flag (1 if word ends a sentence, 0 otherwise)
- lem -- the lemma.
- ord -- the word ordinal within the text (starts at 1)
- part -- the word part flag. "N" for a word which is not split;
"I" for the first part of a split word; "M" for the middle
parts of a split word; and "F" for the final part of a split word.
- pos -- the part of speech.
- reg -- the standard spelling.
- spe -- the corrected original spelling.
- tok -- The original token.
|