NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
XML Output

Introduction

MorphAdorner can add word-level morphological adornments to XML texts encoded in two common formats, the Text Encoding Initiative (TEI) format or the Text Creation Partnership (TCP) format. Other XML formats can be accommodated using customized input methods.

MorphAdorner adds XML tags to mark words, punctuation, and whitespace. All other XML tags which appear in the input file are passed through to the output unchanged except for minor reformatting.

TEI-Analytics

For the Monk project (2007-2009), all input texts were mapped to a common subset of TEI called TEI-Analytics, using the Abbott framework developed by Brian Pytlik Zillig and Steve Ramsey at the University of Nebraska. TEI-Analytics was jointly developed by Martin Mueller at Northwestern University and Brian Pytlik Zillig and Steve Ramsey at the University of Nebraska. TEI-Analytics is the default XML input format assumed by MorphAdorner. TEI-Analytics is a minor modification of the P5 TEI-Lite schema, with additional elements from the Linguistic Segment Categories to support morphosyntactic annotation and lemmatization.

TEI-Analytics has been revised over the past few years and is now, except for the word-level adornments, a proper subset of TEI P5.

XML Tag types: Hard, Soft, and Jump Tags

In order to adorn an XML formatted text properly, MorphAdorner determines the reading context of each word in the input text by constructing the reading sequence for the text. The reading context for a word depends upon the type of XML tag in which it appears as well as the text of its neighboring words.

A hard tag is an SGML, HTML, or XML tag which starts a new text segment but does not interrupt the reading sequence of a text. Examples of hard tags include <div> and <p>.

A jump tag is an SGML, HTML, or XML tag which interrupts the reading sequence of a text and starts a new text segment. An example of a jump tag is <note>. Jump tags initiate a new reading context. The previous reading sequence continues after the end of the jump tag.

A soft tag is an SGML, HTML, or XML tag which does not interrupt the reading sequence of a text and does not start a new text segment. Some soft tags provide textual decoration such as <hi> and <em>. Others indicate textual milestones such as <milestone> or formatting such as <lb>. Still others mark higher level text segments such as <rs>.

The <w>, <pc> and <c> tags

MorphAdorner uses the <w> tag to enclose the text of a word or symbol, the <pc> tag to enclose punctuation marks, and the <c> tag to enclose whitespace.

MorphAdorner v1 used <w> for both words and punctuation as the <pc> element was not officially adopted at the time MorphAdorner was originally developed.

The text enclosed by the <w></w> tags is the original token text, which may be a complete word token, or a token fragment when the token text is split by soft or jump tags. Split words are discussed below.

MorphAdorner normalizes the whitespace in input documents, mapping all multiple blanks, tabs, and end of line characters to single blanks. The normalized whitespace is output using the <c> tag. Each <c> </c> tag pair encloses a single whitespace character.

To prevent output lines from becoming too long, MorphAdorner emits each <w></w> tag and each <c></c> tag on a separate output line. Most other XML tags are also indented and emitted on separate lines. This "pretty-printing" implies that programs which process the MorphAdorner output should ignore end of line characters and use the contents of the <c></c> tags to perform basic text spacing.

One of the early decisions we made in the Monk project was that the adorned XML files should be more-or-less human readable, although in practice no human being outside of programmers would probably spend much time looking at the texts. That means that each line of output should fit, as much as possible, in the width of a typical computer screen. "Pretty-printing" the XML in this way, with indentation to show structure, introduces a great deal of extra whitespace. It is unreasonable to expect each and every program and programmer to determine what whitespace is part of the "pretty-printing" and what is part of the text. That is why we mark the textual whitespace using <c> to make it unambiguous. Whitespace which is not enclosed in <c> tags can be ignored for purposes of textual analysis or display.

<w> tag attributes

MorphAdorner defines the following attribute fields for each <w> tag.

xml:id Provides a unique id for the token or token fragment. This should be treated as an opaque value. See the section on word IDs below.
ord Specifies the ordinal of the token, beginning at 1 for the first token. The ordinal is consecutive across all XML tags. MorphAdorner assigns the same ordinal value to all parts of a token split by soft tags since these token fragments appear consecutively in the input file. Tokens split by jump tags may receive different ordinal values for non-consecutive fragments. Emitted by default in MorphAdorner v1; optional and not emitted by default in MorphAdorner v2.
eos Used in MorphAdorner v1 by default. A value of "1" indicates this token ends a sentence. A value of "0" indicates this token does not end a sentence. The eos value is most accurately set for ordinary text. Tokens within cells or other abbreviated entries may not be marked correctly. The eos was used by default in MorphAdorner v1, and remains an option in MorphAdorner v2. MorphAdorner v2 marks the end of a sentence by adding an unit="sentence" attribute to the last token in a sentence, specified either by a <w> or <pc>. In some cases an empty <pc unit="sentence"/> is used to mark the end of a sentence. Use of the unit="sentence" attribute value is more in line with TEI P5.
lem Provides the lemma head word form(s) of the token. For punctuation and symbols this is the same as the spelling. For words, this is the base form or head word (uninflected) form you would find in a dictionary. When a word contains more than one lemma, a vertical bar separates the lemma forms.
n Provides a location ID based upon a page image identifier and column within page. Optional; mostly used when adorning Text Creation Partnership after conversion from SGML to TEI XML format to maintain the tie between the original digitized page images and the text transcription.
part Indicates which part of a split token this token text provides.
  • A value of "N" means the token text is unsplit.
  • A value of "I" means the token text is the first part of a split token.
  • A value of "M" means the token text is some part after the first but before the last.
  • A value of "F" means the token text is the last part of a split token.
pos The part of speech for the token. By default, MorphAdorner uses the NU POS part of speech tag set. For symbols and punctuation the part of speech is the same as the token. For words containing more than one part of speech (e.g., contractions), a vertical bar separates the part of speech tags.
reg A standardized, usually modern, version of the spelling. For obsolete words no longer in use, a representative standard form is chosen which is usually the Oxford English Dictionary headword form.
sn The sentence number, starting at 1 and running through the text. Cognizant of sentences split by jump tags. Optional, and not emitted by default.
spe The spelling. This value combines the fragments of a split word into the complete spelling. In most cases the spe value will match the tok value. However, some corpora use special metacharacters in the tokens which are not intended to be part of a word. For example, the TCP/EEBO texts use characters such as the "+" and "|" to mark various kinds of word breaks. The tok attribute value retains those metacharacters for archival completeness, but the spe value removes them.
tok The original token text. Includes all metacharacters in the original text. The tok value may be a fragment of the complete token when the token text is split by soft or jump tags.
wn The word number within a sentence, starting at 1. Cognizant of sentences split by jump tags. Optional, and not emitted by default.

Word IDs

MorphAdorner assigns a unique word ID to each word token in an adorned file using the xml:id= attribute. The principal role of word IDs is to provide a way for different programs to refer to the same words in adorned files. Without word IDs any individual program can still generate its own IDs if needed. However these IDs will differ in each program, rendering it difficult to determine when programs are referring to the same word.

The only property required of a word ID is that it be unique for each word.

MorphAdorner generates unique word IDs that start with the work identifier, taken from the file name of the work, followed by a hyphen, followed by another value which is unique within the work. MorphAdorner can generate two types of values for the within the work part of the ID: either a "reading context order" (the default) or "word within page block".

The "reading context order" appends integer values reflecting the reading context order defined by the classification of hard, soft, and jump tags. This is the default type.

The "word within page block" appends two integer values in the the form pageblocknumber-wordinblock, where pageblocknumber is the ordinal of the current <pb> (page break) entry, and wordinblock is the number of the word within the page block (starting at 1 * spacing). When the text contains no page break elements, all words appear as part of block 0.

The spacing value provides the increment from one ID value to the next. 10 is the default spacing. Setting the spacing to a value of 10 or 20 (or larger) allows editing programs to interpolate corrections between existing words when the tokenization needs correction. This allows the word IDs to be more stable while the editing process continues. When the spacing is set to 1, adding or removing a word requires a complete resequencing in the case of reading context order IDs or a resequencing of an entire block in the case of word within page block IDs. The resequencing process is not something a human being will do, but is the province of a program such as an editing program, since not only the word IDs but the word ordinals, sentence numbers, and word numbers with sentences will require updating.

The advantage of the "reading context order" type is that a program can extract just the word elements to get the relative position of words and sentences. By sorting words by the word ID it is simple to extract sentences and n-grams without having to worry about hard, jump, and soft tags. The disadvantage is that any change in the tag structure or the classification of tags invalidates the reading context order property (but the word IDs are still valid as unique values).

The advantage of the "word within page block" type is that it provides a basis for displaying a citation position for words. Of course any individual program can generate citations without reference to word IDs, but it may be helpful to have a consistent basis for generating citations. The disadvantage is that each individual program must fully parse the XML and understand the soft, hard, and jump tag structure in order to determine the reading context order so that sentences and n-grams can be extracted.

Numerous tokenization errors remain in many digitized texts. Some errors come from the original digitization. Others come from mistakes introduced by MorphAdorner. Once these tokenization errors have been corrected, the word IDs can be resequenced and citations can be stabilized.

Location IDs

In addition to its xml:id MorphAdorner can generate a location ID as the 'n' attribute of <w> and <pc> elements. The purpose of this location ID is to facilitate alignment of the transcribed text with the page image, a key requirement for many forms of work with retro-digitized documents. The location ID is based on the page number of the digital scan, typically a double page. For examplem, it is referenced in the Text Creation Partnership SGML source texts as the value of the REF attribute in <PB> elements and appears as the value of the 'facs' attribute in the P5 version. Page numbers of the printed source appear in the PB elements as the value of N attributes, but not all printed pages have running page numbers. The location ID uses 'a' and 'b' to distinguish the parts of a double-paged scan.

More precisely, the location ID takes the form facs-column-wordinpage where facs comes from the attributes of the enclosing <pb> element, column is a letter starting with "a" and giving the column number on the printer page, and wordinpage is the ordinal of the word within the page starting at 1 multiplied by the spacing. Subsequent location ID values have a wordinpage value incremented by the given spacing value, which is 10 by default. Optionally the work ID (usually the base file name) can be prepended to the location ID.

Here is a typical example of a location ID.

  • 2-a-0050

This refers to the first column, fifth word in page image 2 for the current work.

These can be long identifiers, but theoretically only the page-base counter needs to be recorded as an 'n' attribute. If page-based IDs are needed, they can be constructed on the fly or in a preprocessing step by concatenating the work ID, the attribute values of the <pb> element and the page counter. It may also be practical to construct an xml:id for each page by concatenating the workid with attribute values, as in <pb xml:id="A05137-025-051" facs="25" n="51" />

Marking the end of a sentence

MorphAdorner v1 used the eos= attribute on the <w> tag to mark a token which ends  a sentence. We considered using <milestone> tags to mark sentence,  but these presented many problems when sentences span jump tags. The same was true of seg-like markers such as <s>.

MorphAdorner v2 uses the unit= attribute with a value of "sentence" to mark the end of a sentence. This aligns with standard TEI P5 usage.

Using a word-level value -- either the eos attribute or the unit= attribute -- to mark the end of a sentence makes it easy to generate sentence information regardless of how one orders the text when dealing with jump tags. For example, Prior (part of Monk) and and WordHoard move jump tag content to the end of the work part. That enormously simplifies text display and operations such as collocate extraction. MorphAdorner, when requested to extract sentences, tries to leave the sentences in jump tags as close to their original location in the text. The same word-level flag supports either approach (or other approaches).

Using sn= to add sentence numbers is another approach.

Abbreviated attribute output

By default MorphAdorner outputs the full set of <w> attributes. MorphAdorner can also output an abbreviated attribute set, in which only non-redundant attribute values appear in the <w> tag. This produces smaller output files with no loss of information, since the omitted attribute field values can be restored from those of the other attributes or the token text.

MorphAdorner uses the following algorithm to generate the abbreviated set of <w> tag attributes.

  1. Let the token-text be the text enclosed within the <w></w> tag pair.
  2. When tok has the same value as the token-text, omit the tok attribute.
  3. When spe has the same value as tok, omit the spe attribute.
  4. When reg has the same value as spe, omit the reg attribute.
  5. When pos has the same value as tok, omit the pos attribute.
  6. When lem has the same value as spe, omit the lem attribute.
  7. When eos has the value "0", omit the eos attribute.
  8. When part has the value "N", omit the part attribute.

The following algorithm can be used to reconstruct the full set of <w> attributes from the abbreviated set.

  1. When tok is missing, set its value to the text enclosed by the <w></w> tags.
  2. When spe is missing, set its value to the value of tok.
  3. When reg is missing, set its value to the value of spe.
  4. When pos is missing, set its value to the value of tok.
  5. When lem is missing, set its value to the value of spe.
  6. When eos is missing, set its value to "0" (zero).
  7. When part is missing, set its value to "N".

The attribute values for xml:id and ord are always present in either abbreviated or verbose output files.

Split tokens

Individual tokens in XML texts may be split by soft tags, and occasionally by jump tags. MorphAdorner assembles the fragments of a split token into a complete token and sets the tok and spe attributes of the <w> tag for the token fragment to contain the complete token.

The xml:id field for a split word adds "dot partnumber" to the end of the <w> tag's xml:id value. The xml:id can still be treated as an opaque object, but the part number can be extracted from the end if desired. In many cases the part number is not needed, and the value of the part attribute of the <w> tag suffices.

  • part="N" means the token is unsplit (complete).
  • part="I" means the token is the first part of a split token.
  • part="M" means the token is some part after the first but before the last.
  • part="F" means the token is the last part of a split token.

Here is an example of a split word from Austen's Lady Susan (ancf0207.xml). The original XML text is:

<p rend="align(r)">Edward S<hi rend="sup(1)">t</hi>.</p>

The "St." token is split into three pieces by soft tags. The corresponding adorned text is:

<p rend="align(r)">
  <w eos="0" lem="Edward" pos="np1" reg="Edward"
     spe="Edward" tok="Edward" xml:id="ancf0207-050740" part="N"
    >Edward</w>
  <c> </c>
  <w eos="1" lem="saint" pos="n1" reg="St." spe="St." tok="St."
     xml:id="ancf0207-050750.1" part="I">S</w>
  <hi rend="sup(1)">
     <w eos="1" lem="saint" pos="n1" reg="St." spe="St." tok="St."
       xml:id="ancf0207-050750.2" part="M">t</w> 
  </hi>
   <w eos="1" lem="saint" pos="n1" reg="St." spe="St." tok="St."
     xml:id="ancf0207-050750.3" part="F">.</w>
</p>

When an ord attribute appears, its value is the same for all three fragments of "St." . This is also the case for words split solely by soft tags. The optional ord attribute values will not be the same for words split by jump tags, as the individual word fragments can be separated by hundreds or even thousands of other words.

Simplified TEI P5-like output

MorphAdorner v2 provides the AdornedToSimpleTEIP5 utility which converts the non-standard word attribute values of adorned files to a simpler and more nearly standard TEI P5 format.

The simplified format emits the lemmata, the parts of speech, and the standard spelling for each token. The attribute names have changed to be compatible with TEI P5: lem is changed to lemma, pos is mapped to ana and a "#" prepended to the part of speech. The non-standard reg attribute can be retained or changed to a standard TEI P5 choice structure. The corrected spelling (spe), original token (tok), and word ordinal (ord), if any, are removed.

Here is a sample snippet showing the new adorned file format.

<w lemma="in" ana="#p-acp" reg="in" xml:id="A88624-000740">in</w>
<c> </c>
<w lemma="love" ana="#n1" reg="love" xml:id="A88624-000750">love</w>
<c> </c>
<w lemma="with" ana="#p-acp" reg="with" xml:id="A88624-000760">with</w>
<c> </c>
<w lemma="Ismenia" ana="#np1" reg="Ismenia" xml:id="A88624-000770">Ismenia</w>
<pc unit="sentence" xml:id="A88624-000780">.</pc>

Named Entities

MorphAdorner contains an experimental procedure which extends the Gate v4.0 facility for adding named entity tags to input texts. Each named entity is enclosed by <rs type="named entity type" ></rs> tags. The type= attribute value specifies the type of the named entity, which may be one of the following.

type="date" A date reference (e.g., March 12).
type="location" A geographical location (e.g., England).
type="money" An amount of money (e.g., 1 shilling).
type="organization" An organization name (e.g., Bank of England)
type="person" A person's name (e.g., Emma Woodhouse)
type="time" A time reference (e.g., 12 midnight)
type="literary" A literary reference (e.g., Ivanhoe)
Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk