edu.northwestern.at.morphadorner.tools.stripwordattributes (MorphAdorner)

Class Summary
Class Description

StripWordAttributes
Create derived MorphAdorner file with word elements stripped of attributes.

Class Summary
Class	Description
StripWordAttributes	Create derived MorphAdorner file with word elements stripped of attributes.

Package edu.northwestern.at.morphadorner.tools.stripwordattributes Description

Create derived MorphAdorner file with word elements stripped of attributes.

Usage:

java edu.northwestern.at.morphadorner.tools.stripwordattributes.StripWordAttributes input.xml output.xml output.tab [/[no]id] [/[no]trim]

input.xml	Input MorphAdornerd xml file.
output.xml	Derived adorned file with word element attributes stripped.
output.tab	Tab delimited file of word element attribute values.
/id or /noid	Optional parameter indicating xml:id should be left attached to each word (<w>) element. Default is /noid which removes the xml:id attribute and value.
/trim or /notrim	Optional parameter indicating whether whitespace should be trimmed from the start and end of each XML text line. Default is /notrim, which leaves the original whitespace intact.

The derived adorned output file "output.xml" has all attributes stripped from each <w> tag.

The attribute values for each "<w>" element in the input.xml file are extracted and output to the tab-separated values output.tab file. The order of the attribute lines matches the order of appearance of the <w> elements in the XML output file. When /id is used, the xml:id value in each <w> element in output.xml can be matched with the corresponding xml:id value in output.tab .

The first line in output.tab contains the attribute names for each column. Each subsequent line in the output.tab file contains at least the following information corresponding to a single word "<w>" element. Some adorned files may add extra word attributes, resulting in more columns.

xml:id -- the permanent word ID.
eos -- the end of sentence flag (1 if word ends a sentence, 0 otherwise)
lem -- the lemma.
ord -- the word ordinal within the text (starts at 1)
part -- the word part flag. "N" for a word which is not split; "I" for the first part of a split word; "M" for the middle parts of a split word; and "F" for the final part of a split word.
pos -- the part of speech.
reg -- the standard spelling.
spe -- the corrected original spelling.
tok -- The original token.