Northwestern University Information Technology
MorphAdorner Northwestern
Converting a base adorned file to a simple TEI P5-like format

AdornedToSimpleTEIP5 converts a base-level MorphAdorner file to a simpler, more TEI P5-like format.


adornedtosimpleteip5 outputdirectory [usereg|usechoice] interpgrp.xml goodfiles.txt badfiles.txt adorned1.xml adorned2.xml ...


  • outputdirectory specifies the output directory for the base adorned XML files.
  • usereg specifies that the standardized spelling should be emitted as a reg= attribute, while usechoice specifies that the standardized spelling should be emitted using TEI <choice> structure.
  • interpgrp.xml specifies the file name for a section of TEI XML which defines an interpGrp element for the part of speech tags. This can be an empty file in which case the interpGrp is not added to each output TEI XML file.
  • goodfiles.txt specifies the name of a file to receive the names of TEI XML files successfully converted to simple TEI P5 format.
  • badfiles.txt specifies the name of a file to receive the names of TEI XML files which could not be successfully converted to simple TEI P5 format.
  • adorned1.xml adorned2.xml ... specifies the input MorphAdorned XML files from which to produce simple TEI P5 versions.

AdornedToSimpleTEIP5 converts the base form of an adorned TEI file, which adds custom attributes to word <w> elements, to a simpler more TEI P5 compatible format as follows.

  • The pos attribute, which specifies the part of speech, is changed to the P5 standard ana attribute. The part of speech is prefixed with a "#".
  • The lem attribute, which specifies the lemma (headword) for the word, is changed to the P5 standard lemma attribute.
  • The reg attribute, which specifies the modernized spelling, is handled as described below.
  • The other non-standard attributes ord, spe, tok, etc. are dropped.

In standard TEI P5 you cannot store a standardized spelling in a reg attribute. One approach is to use a combination of <choice>, <orig>, and <reg> elements to make each <w> element carry its part of a double stream of original and standardized spellings, as in this adorned encoding of "wylle anone" from an early 16th century text:

   <w xml:id ="someid1" lemma="will" ana="#vmb">
   <w xml:id ="someid2" lemma="anon" ana="#av">>

Alternatively, you can customize P5 and restore a reg attribute that lets you encode the same phenomena in a manner that programmers -- and in particular programmers with limited skills -- are likely to find more intuitive and economical:

<w xml:id ="someid1" lemma="will" reg= "will" ana="#vmb">wylle</w>
<w xml:id ="someid2" lemma="anon" reg ="anon" ana="#av">anone</w>

For many purposes using an attribute is preferable to a choice element because the attribute leaves the token sequence undisturbed, and the added attribute value can be stored in the standard MorphAdorner change log format.

AdornedToSimpleTEIP5 allows you to use either of these two approaches.

  • Select AdornedToSimpleTEIP5's usechoice option to store the standard spelling using a <choice> structure.
  • Select AdornedToSimpleTEIP5's usereg option to store the standard spelling using a reg attribute.

Important: Many other MorphAdorner utilities do not yet work properly with simplified adorned texts created using the <choice> structure.

Defining the parts of speech using an interGrp element

Strictly speaking, a TEI interpGrp element should be added to each TEI XML output file to specify the definitions for the parts of speech used. The MorphAdorner release materials include a nuposinterpgrp.xml file in the release data/ directory which defines an interpGrp for the NUPos tag set. This file can be specified as the value of AdornedToSimpleTEIP5's interpgrp.xml parameter.

Announcements and News
Download MorphAdorner
Helpful References
Tech Talk