Northwestern University Information Technology
MorphAdorner Northwestern
Verticalizing An Adorned Text

XMLToTab converts MorphAdorner XML output to tab-separated tabular form.


xmltotab input.xml


  • input.xml is the input MorphAdorned XML file.
  • is the output tab-separated values file.

The attribute values for each <w> and <pc> element in the input XML file are extracted and output to a tab-separated values text file. An output line contains the following information corresponding to a single word <w> or <pc> element.

  1. The work ID.
  2. The permanent word ID.
  3. The uncorrected original spelling.
  4. The uncorrected original spelling reversed.
  5. The standard spelling.
  6. The lemma.
  7. The part of speech.
  8. An XPath-like path to this word. The leading work ID and trailing word number are removed from the path.
  9. The end of sentence flag. 1 if this word ends a sentence, 0 otherwise.
  10. The previous word's original spelling.
  11. The next word's original spelling.
  12. The previous word's parts of speech.
  13. The next word's parts of speech.
  14. Up to 80 characters of text preceding the word in the text.
  15. Up to 80 characters of text following the word in the text.
  16. The word label. May be empty.
  17. The div type of the nearest ancestor div element. May be empty.
  18. The word ordinal.

This tabular representation of an adorned XML text is useful for data checking purposes. The morphological attribute values for each word <w> element appear as columns. The 80 characters (or so) of text on either side of the word allows you to focus on particular part of speech tags and pinpoint errors from the automatic adornment process. The tab separated values may also be used to construct spreadsheets or databases of the individual word information.

Announcements and News
Download MorphAdorner
Helpful References
Tech Talk