Introduction
MorphAdorner can add word-level morphological adornments
to XML texts encoded in two common formats, the Text Encoding
Initiative (TEI) format or the Text Creation Partnership (TCP) format.
Other XML formats can be accommodated using customized input methods.
MorphAdorner adds XML tags to mark words, punctuation, and whitespace.
All other XML tags which appear in the input file are passed through to the
output unchanged except for minor reformatting.
TEI Analytics
For the Monk project, all input texts were mapped to a common
subset of TEI called TEI Analytics, using the Abbott
framework developed by Brian Pytlik Zillig and
Steve Ramsey at the University of Nebraska. TEI Analytics was
jointly developed by Martin Mueller at Northwestern University
and Brian Pytlik Zillig and Steve Ramsey at the University of Nebraska.
TEI Analytics is the default XML input format assumed by MorphAdorner.
TEI Analytics is a minor modification of the P5 TEI-Lite schema,
with additional elements from the Linguistic Segment Categories
to support morphosyntactic annotation and lemmatization.
XML Tag types: Hard, Soft, and Jump Tags
In order to adorn an XML formatted text properly, MorphAdorner determines the reading context of each word in the input text by constructing the reading sequence for the text. The reading context for a word depends upon the type of XML tag in which it appears as well as the text of its neighboring words.
A hard tag is an SGML, HTML, or XML tag which starts a new
text segment but does not interrupt the reading sequence of a text.
Examples of hard tags include <div> and <p>.
A jump tag is an SGML, HTML, or XML tag which interrupts the reading
sequence of a text and starts a new text segment. An example of a jump tag
is <note>. Jump tags initiate a new reading context.
The previous reading sequence continues after the end of the jump tag.
A soft tag is an SGML, HTML, or XML tag which does not interrupt the
reading sequence of a text and does not start a new text segment. Some
soft tags provide textual decoration such as <hi> and <em>. Others
indicate textual milestones such as <milestone> or formatting such as <lb>.
Still others mark higher level text segments such as <rs>.
The <w> and <c> tags
MorphAdorner uses the <w> tag to enclose the text of a word,
a symbol, or a punctuation mark, and the <c> tag to enclose
whitespace.
This may strike some TEI afficionados as an odd idea. However,
treating punctuation and words the same way simplifies
processing. Promoting the punctuation "meta-data" added by authors
(or editors) to the same level as the words allows a consistent
treatment of token transition probabilities for adornment processes such as
part of speech tagging.
One alternative might be to drop <w> and use
specialized <seg> tags, with types indicating the nature
of the enclosed token. But in a sense that is what <w>
already does. We could use a sequence such as
<w><c>punc</s></w>. as some others do.
That seems unnecessarily redundant. We could use <c> with
an ID, which requires more complicated programming to ensure the
sequence of derived words/tokens is correct, since we could no
longer pick up non-whitespace tokens by just looking for
<w> elements. We would need to extract punctuation from
<c> elements as well, and make sure these appear in the
right sequence with sibling <w> elements. Whitespace
<c> elements would be those without an ID. On balance we
believe it is helpful to reserve the <c> element only for
whitespace so that the tokens and whitespace are clearly
separated by using different tags.
The text enclosed by the <w></w> tags is the original token text,
which may be a complete word token, or a token fragment when the
token text is split by soft or jump tags. Split words are discussed below.
MorphAdorner normalizes the whitespace in input documents, mapping
all multiple blanks, tabs, and end of line characters to single blanks.
The normalized whitespace is output using the <c> tag. Each <c> </c>
tag pair encloses a single whitespace character.
To prevent output lines from becoming too long, MorphAdorner emits
each <w></w> tag and each <c></c> tag on a separate output line.
Most other XML tags are also indented and emitted on separate lines. This
"pretty-printing" implies that programs which process the
MorphAdorner output should ignore end of line characters and use the
contents of the <c></c> tags to perform basic text spacing.
One of the early decisions we made in the Monk project was that the adorned XML files should be more-or-less human readable, although in practice no human being outside of programmers would probably spend much time looking at the texts. That means that each line of output should fit, as much as possible, in the width of a typical computer screen. "Pretty-printing" the XML in this way, with indentation to show structure, introduces a great deal of extra whitespace. It is unreasonable to expect each and every program and programmer to determine what whitespace is part of the "pretty-printing" and what is part of the text. That is why we mark the textual whitespace using <c> to make it unambiguous. Whitespace which is not enclosed in <c> tags can be ignored for purposes of textual analysis or display.
<w> tag attributes
MorphAdorner defines the following attribute fields for each <w> tag.
| xml:id |
Provides a unique id for the token or token fragment. This
should be treated as an opaque value. See the section on word IDs below.
|
| ord |
Specifies the ordinal of the token, beginning at 1 for the first
token. The ordinal is consecutive across all XML tags.
MorphAdorner assigns the same ordinal value to all parts of a token
split by soft tags since these token fragments appear consecutively
in the input file. Tokens split by jump tags may receive different
ordinal values for non-consecutive fragments. |
| eos |
A value of "1" indicates this token ends a sentence.
A value of "0" indicates this token does not end a sentence.
The eos value is most accurately set for ordinary text. Tokens
within cells or other abbreviated entries may not be marked
correctly. See below for an explanation of why we mark end of sentences this way.
|
| lem |
Provides the lemma head word form(s) of the token. For punctuation and
symbols this is the same as the spelling. For words, this is the base
form or head word (uninflected) form you would find in a dictionary.
When a word contains more than one lemma, a vertical bar
separates the lemma forms.
|
| part |
Indicates which part of a split token this token text provides.
- A value of "N" means the token text is unsplit.
- A value of "I" means the token text is the first part of a split token.
- A value of "M" means the token text is some part after the first but
before the last.
- A value of "F" means the token text is the last part of a split token.
|
| pos |
The part of speech for the token. By default, MorphAdorner
uses the NUPOS
part of speech tag set. For symbols and punctuation
the part of speech is the same as the token. For words containing
more than one part of speech (e.g., contractions), a vertical bar
separates the part of speech tags.
|
| reg |
A standardized, usually modern, version of the spelling.
For obsolete words no longer in use, a representative standard form
is chosen which is usually the Oxford English Dictionary headword form. |
| sn |
The sentence number, starting at 1 and running through the text. Cognizant of sentences split by jump tags. Optional, and not used in the Monk project. |
| spe |
The spelling. This value combines the fragments
of a split word into the complete spelling. In most cases the
spe value will match the tok value. However, some
corpora use special metacharacters in the tokens which are
not intended to be part of a word. For example, the TCP/EEBO
texts use characters such as the "+" and "|" to mark various
kinds of word breaks. The tok attribute value retains those
metacharacters for archival completeness, but the spe value
removes them. |
| tok |
The original token text. Includes all metacharacters
in the original text. The tok value may be a fragment of
the complete token when the token text is split by soft or jump tags. |
| wn |
The word number within a sentence, starting at 1. Cognizant of sentences split by jump tags. Optional, and not used in the Monk project. |
Word IDs
MorphAdorner assigns a unique word ID to each word token in an adorned file using the xml:id= attribute. The principal role of word IDs is to provide a way for different programs to refer to the same words in adorned files. Without word IDs any individual program can still generate its own IDs if needed. However these IDs will differ in each program, rendering it difficult to determine when programs are referring to the same word.
The only property required of a word ID is that it be unique for each word.
MorphAdorner generates unique word IDs that start with the work identifier, taken from the file name of the work, followed by a hyphen, followed by another value which is unique within the work. MorphAdorner can generate two types of values for the within the work part of the ID: either a "reading context order" (the default) or "word within page block".
The "reading context order" appends integer values reflecting the reading context order defined by the classification of hard, soft, and jump tags. This is the default type.
The "word within page block" appends two integer values in the the form pageblocknumber-wordinblock, where pageblocknumber is the ordinal of the current <pb> (page break) entry, and wordinblock is the number of the word within the page block (starting at 1 * spacing). When the text contains no page break elements, all words appear as part of block 0.
The spacing value provides the increment from one ID value to the next. 10 is the default spacing. Setting the spacing to a value of
10 or 20 (or larger) allows editing programs to interpolate corrections between existing words when the tokenization needs correction. This allows the word IDs to be more stable while the editing process continues. When the spacing is set to 1, adding or removing a word requires a complete resequencing in the case of reading context order IDs or a resequencing of an entire block in the case of word within page block IDs. The resequencing process is not something a human being will do, but is the province of a program such as an editing program, since not only the word IDs but the word ordinals, sentence numbers, and word numbers with sentences will require updating.
The advantage of the "reading context order" type is that a program can extract just the word elements to get the relative position of words and sentences. By sorting words by the word ID it is simple to extract sentences and n-grams without having to worry about hard, jump, and soft tags. The disadvantage is that any change in the tag structure or the classification of tags invalidates the reading context order property (but the word IDs are still valid as unique values).
The advantage of the "word within page block" type is that it provides a basis for displaying a citation position for words. Of course any individual program can generate citations without reference to word IDs, but it may be helpful to have a consistent basis for generating citations. The disadvantage is that each individual program must fully parse the XML and understand the soft, hard, and jump tag structure in order to determine the reading context order so that sentences and n-grams can be extracted.
Numerous tokenization errors remain in many digitized texts. Some errors come from the original digitization. Others come from mistakes introduced by MorphAdorner. Once these tokenization errors have been corrected, the word IDs can be resequenced and citations can be stabilized.
Marking the end of a sentence with the eos= attribute
MorphAdorner uses the eos= attribute on the <w> tag for marking a token which ends a sentence. We considered using <milestone> tags to mark sentence, but these present many problems when sentences span jump tags. The same is true of seg-like markers such as <s>.
Using a word-level marker for end of sentence makes it easy to generate sentence information regardless of how one orders the text when dealing with jump tags. For example, Prior and and WordHoard move jump tag content to the end of the work part. That enormously simplifies text display and operations such as collocate extraction. MorphAdorner, when requested to extract sentences, tries to leave the sentences in jump tags as close to their original location in the text. The same word-level flag supports either approach (or other approaches).
Using sn= to add sentence numbers is another approach.
Abbreviated attribute output
By default MorphAdorner outputs the full set of <w> attributes.
MorphAdorner can also output an abbreviated attribute set, in which
only non-redundant attribute values appear in the <w> tag. This produces
smaller output files with no loss of information, since the omitted attribute
field values can be restored from those of the other attributes or the
token text.
MorphAdorner uses the following algorithm to generate the abbreviated
set of <w> tag attributes.
- Let the token-text be the text enclosed within the <w></w> tag pair.
- When tok has the same value as the token-text, omit the tok attribute.
- When spe has the same value as tok, omit the spe attribute.
- When reg has the same value as spe, omit the reg attribute.
- When pos has the same value as tok, omit the pos attribute.
- When lem has the same value as spe, omit the lem attribute.
- When eos has the value "0", omit the eos attribute.
- When part has the value "N", omit the part attribute.
The following algorithm can be used to reconstruct the full set of
<w> attributes from the abbreviated set.
- When tok is missing, set its value to the text enclosed by the <w></w> tags.
- When spe is missing, set its value to the value of tok.
- When reg is missing, set its value to the value of spe.
- When pos is missing, set its value to the value of tok.
- When lem is missing, set its value to the value of spe.
- When eos is missing, set its value to "0" (zero).
- When part is missing, set its value to "N".
The attribute values for xml:id and ord are always present in either
abbreviated or verbose output files.
Split tokens
Individual tokens in XML texts may be split by soft tags, and occasionally
by jump tags. MorphAdorner assembles the fragments of a split token into
a complete token and sets the tok and spe attributes of the
<w> tag for the token fragment to contain the complete token.
The xml:id field for a split word adds "dot partnumber" to the end
of the <w> tag's xml:id value. The xml:id can still be treated as an opaque
object, but the part number can be extracted from the end if desired.
In many cases the part number is not needed, and the value of the
part attribute of the <w> tag suffices.
- part="N" means the token is unsplit (complete).
- part="I" means the token is the first part of a split token.
- part="M" means the token is some part after the first but before the last.
- part="F" means the token is the last part of a split token.
Here is an example of a split word from Austen's Lady Susan
(ancf0207.xml). The original XML text is:
<p rend="align(r)">Edward S<hi rend="sup(1)">t</hi>.</p>
The "St." token is split into three pieces by soft tags.
The corresponding adorned text is:
<p rend="align(r)">
<w eos="0" lem="Edward" pos="np1" reg="Edward"
spe="Edward" tok="Edward" xml:id="ancf0207-050740" part="N"
ord="4958">Edward</w>
<c> </c>
<w eos="1" lem="saint" pos="n1" reg="St." spe="St." tok="St."
xml:id="ancf0207-050750.1" part="I" ord="4959">S</w>
<hi rend="sup(1)">
<w eos="1" lem="saint" pos="n1" reg="St." spe="St." tok="St."
xml:id="ancf0207-050750.2" part="M" ord="4959">t</w>
</hi>
<w eos="1" lem="saint" pos="n1" reg="St." spe="St." tok="St."
xml:id="ancf0207-050750.3" part="F" ord="4959">.</w>
</p>
The ord attribute value is the same for all three fragments
of "St." . This is also the case for words split solely by soft tags.
The ord attribute values will not be the same for words split by
jump tags, as the individual word fragments can be separated by
hundreds or even thousands of other words.
Named Entities
MorphAdorner contains an experimental procedure which extends the Gate facility for adding named entity tags
to input texts.
Each named entity is enclosed by <rs type="named entity type"
></rs> tags.
The type= attribute value specifies the type of the named entity, which
may be one of the following.
| type="date" |
A date reference (e.g., March 12). |
| type="location" |
A geographical location (e.g., England). |
| type="money" |
An amount of money (e.g., 1 shilling). |
| type="organization" |
An organization name (e.g., Bank of England) |
| type="person" |
A person's name (e.g., Emma Woodhouse) |
| type="time" |
A time reference (e.g., 12 midnight) |
| type="literary" |
A literary reference (e.g., Ivanhoe) |
|