NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
TCP Corpora Processing    

Introduction

This page describes the sequence of steps that begin with an SGML encoded Text Creation Partnership file and transform it into a linguistically adorned file processed with Abbot and MorphAdorner.

The MorphAdorner v2.0 project seeks to capture the orthographic and morphological variety of Early Modern printed books and make their texts available in formats that both articulate and erase difference. Once Abbot has transformed the original SGML transcriptions into TEI XML, MorphAdorner tokenizes each word occurrence in a text and maps its surface form to the combination of a lemma and part of speech. A surface form like 'louyth' is mapped to the combination of the lemma 'love' and the POS tag 'vvz'. A 'lempos' or combination of lemma and POS tag can be used as the basis for a standardized spelling. On this view, linguistic adornment provides a virtual erasure of difference, which is useful for some purposes. Alternately, a lempos can also be used to look for the different surface forms in which that particular lexical and morphological phenomenon is realized. On that view, useful for other purposes, linguistic adornment provides a procedure for discovering and analyzing difference.

It is a major goal of the Abbot and EEBO MorphAdorner collaboration to turn the TCP texts into the foundation for a "Book of English," defined as:

  • a large, growing, collaboratively curated, and public domain corpus of written English since its earliest modern form
  • with full bibliographical detail
  • and light but consistent structural and linguistic annotation.

Texts in the adorned TCP corpus will exist in more than one format so as to facilitate different uses to which they are likely to be put. In a first step, Abbot transforms the SGML source text into a TEI P5 XML format. Abbot, a software program designed by Brian Pytlik Zillig and Stephen Ramsay, can read arbitrary XML files and convert them into other XML formats or a shared format. Abbot generates its own set of conversion routines at runtime by reading an XML schema file and programmatically effecting the desired transformations.

MorphAdorner can output its results in a variety of tabular or XML based formats. Our goal is to provide output formats that can be successfully managed by scholars with moderate programming skills. We also believe that scholars working with the files will discover many instances of incompletely or incorrectly transcribed words and phrases. We want to make it easy to transmit completions or corrections back to the source files. Thus various "bread crumbs" are built into the design of MorphAdorner's routines and output formats. Linguistic adornment, coupled with appropriate analytical tools, opens up many new forms of analysis. But you should not underestimate the cumulative power of the quite humble task of discovering and fixing errors along the way.

The SGML source files

Origin and nature of the source files

The source files come from three Text Creation Partnership archives:

  1. Early English Books Online files from Proquest, representing English books printed before 1700 (~45,000 files)
  2. Eighteenth-Century Collection Online files from Cengage, representing books printed in the 18th century (~2,500 files)
  3. The Evans collection of Early American imprints from Reddex, representing books printed in America before 1800 (~ 5,000 files)

Bibliographical data for all these files are contained in the English Short Title Catalog.

The TCP files were transcribed by various commercial vendors through a double keyboarding method. The transcriptions are based on digital scans of microfilms created in the mid- and late twentieth century. The quality of the microfilms and digital scans is variable. So is the quality of the printed original. Problems of transcription are overwhelmingly a function of what the transcriber was able to see on the digital copy of a microfilm image of a printed page.

The texts were encoded in SGML using a DTD that is a modification of the P3 TEI Guidelines. The files were encoded in ASCII and employ about 1,500 character entities to represent characters and symbols not found in the lower ASCII set of characters.

Typographical changes

The printed sources of the TCP texts use a great variety of typefaces and mix them in various ways. The TCP transcriptions ignore most of this, but use the <hi> tag to mark a change of type. An <hi> element means that the text enclosed by it is set in a different type from the text that surrounds it. This use of the <hi> tag does not provide any information about the type of the surrounding or enclosed text. In practice, text enclosed by <hi> usually means text in italics surrounded by plain text, but often this is not the case. You cannot reconstruct the "look and feel" of the printed page from the transcription alone.

Idiosyncratic features of the source files

The SGML transcriptions use some project-specific tricks to capture various features of the source files.

Line breaks

The TCP transcriptions do not record line breaks in the printed originals. They do, however record "soft" hyphens where a word straddles two lines. The pipe character or vertical bar is used to mark such line breaks as in "wind|ing".

Word breaks at line endings are not always marked with a hyphen in the printed originals. Transcribers were asked to supply missing soft hyphens with a '+' sign. Sometimes they did, sometimes they didn't. Unmarked word breaks, especially in marginal notes, are a very common feature of the TCP texts.

Superscripts and subscripts

Superscripted and subscripted alphanumerical characters are marked in the SGML transcription with a single or double 'caret', e.g. "S^t^.", "2^^3".

Decorated initial characters

Initial decorated characters in the printed texts are marked in the SGML transcriptions with a preceding underscore, as in "_T".

The interim P5 version of each file

In a first (and reversible) step we use Abbot to transform the P3 SGML version into an XML version that parses under a slightly modified version of TEI P5. The goal here is not to create the perfect P5 version but to express the structure of the SGML files in P5-like XML with minimal changes. However, Abbot is able to generate TEI XML P5 compatible versions of about 99% of the TCP SGML files.

Abbot closes unclosed tags as required by XML, maps the TEI tags to their XML "camel case" versions, changes some tag attributes to their XML format, and replaces the temporary header with the actual TEI header. The header is also converted to XML format. Abbot performs a few other changes as noted below.

Conversion of character entities

Character entities with established utf-8 code points are converted to those code points. This includes the long 's', by far the most common character entity.

Character entities with no corresponding utf-8 code points are preserved using the ad hoc devices of the TCP XML version. Thus the character entity "&abque;", which marks a printer's abbreviation for 'que' is represented by {que}, and curly braces are used in similear cases, wrapping the content of character entities in curly braces.

Line-breaking hyphens

The pipe character used for line-breaking hyphens in the SGML texts is maintained in the XML. The transcriber-supplied hyphen marked with '+' is replaced with the Unicode soft hyphen \u2011.

Superscripts and subscripts

The SGML notation for superscripts and subscripts is maintained in the intermediate P5 version. A post-processing program replaces the SGML notation with XML tags.

Decorated initial characters

The SGML notation for decorated initial characters is maintained in the intermediate P5 version. A post-processing program run after initial tokenization adds a "rend=" attribute to a token containing decorated initial characters.

Gaps

The SGML notation for gaps is modified in the intermediate P5 version. Letter, word, span-based gap extents are changed to a sequence of gap marker characters.

  • The Unicode black circle ● (Unicode u25CF) replaces missing letters.
  • The sequence of Unicode left-angle bracket, lozenge, right-angle bracket〈◊〉 (\u3008\u25CA\u3009) replaces each missing word.
  • The Unicode sequence left-angle bracket, horizontal ellipsis, right-angle bracket〈�〉(\u3008\u2026\u3009) replaces a span of missing text.
  • Simple foreign gaps are replaced by <seg xml:lang="unknown"> 〈◊〉 〈◊〉</seg> .
  • Foreign gap lines (enclosed by <l> tags) are replaced by a sequence of seven〈◊〉missing word markers enclosed in an <l xml:lang="unknown"> tag.

Post-processing the Abbot TEI files

The Abbotized TEI files are modified slightly before they are tokenized. The changes consists primarily of converting the TCP style superscripts to XML tag format.

Converting ^d to elements.

Tokens which end in ^d where "d" is a single digit are converted to the token followed by a d. This allows inserting the missing targets of these apparent note references at a later manual editing stage. Some of these may actually be incorrectly transcribed British monetary markers where the digit "1" was encoded instead of the letter "l".

Superscripts and subscripts

The ~44,000 EEBO texts include ~7,500 distinct superscript patterns with ~625,000 occurrences. All but 113 patterns (with ~700 occurrences in ~90 files) can be satisfactorily presented with current utf-8 characters for superscripts. Subscripts are much less common and mostly numerical. Some subscripts may be wrongly transcribed superscripts, e.g. "S^^r".

Although most superscripts and subscripts can be expressed literally through utf-8, it may be that for most analytical purposes superscripts add a level of complexity without corresponding benefit. There is much to be said for replacing them with plain characters wherever this can be done without creating ambiguity. Getting rid of super and subscripts in those cases removes 98% or more of all instances.

yᵉ, yᵗ, and yᵘ

The spellings yᵉ, yᵗ, and yᵘ are best seen as single brevigraphs representing a whole word. The nature of 'y' in these case is determined by the following letter and represents the thorn or 'th' rather than 'y'. Replacing these brevigraphs with 'the', 'that', and 'thou' probably makes more sense for a linguistically adorned text than keeping the original spellings, which would require special filtering if a researcher wanted to some analysis of the distribution of 'y' and 'i' spellings in 16th century texts.

Texts that have 'yᵉ', 'yᵗ', and 'yᵘ' are also likely to have the brevigraphs 'wᶜ' 'wᵗ' for 'with' and 'which'. These are different from the 'y' cases in the sense that the first letter stands for itself. They do not resolve comfortably to the plain spellings 'wc' or 'wt', and it seems preferable to replace them with the words they stand for.

Common superscripts

The most common superscripts in later texts are strings like Mʳ, Mʳˢ, Dʳ, 2ᵈ, and the like, which are unambigous and intelligible in their plain spellings Mr, Mrs, Dr, 2d.

Problematic superscripts

Some superscripts produce ambiguous or illegible words when written in plain type: 'Maᵗⁱᵉ' and other abbreviations for 'Majesty' are the most common examples. In these case one can fall back on using superscripts.

This fallback position cannot be used for cases where there are no appropriate utf-8 code points. There is no lower case superscript 'q', and only a limited number of upper case characters . The problem is rare: there are 150 types with 700 occurrences across 90 files. In these cases one could wrap the superscripted characters in a <hi rend = "sup">.

For the sake of simplicity, it may be preferable to extend this practice to all cases where super- or subscripts cannot be unambiguously represented as plain letters. An additional argument in favour of going this is the problematical nature of displaying utf-8 superscripted characters. They come from different Unicode ranges and do not form a coherent character family. In some type faces these differences are leveled out. In others they are not. So superscript characters are a little like long 's': not fully at home in the world of utf-8.

Converting superscripts to tag form

Because of the all the difficulties noted above, we decided to convert all superscript sequences given in the ^c^d^e form to <hi rend="superscript">cde</hi> . The intermediate XML file contains some private XML tag sequences to ensure proper spacing is maintained when tokenizing the superscript sequences, and to allow proper recognition of printer's brevigraphs.

The tokenized version

Tokenization consists of mapping the boundaries of "words" and "sentences." From a theoretical perspective, both "words" and "sentences" are highly problematic constructs. In practical terms, the consistent application of heuristics will produce results you can work with in a dependable fashion. But it is not an unambiguous matter of "carving nature at its joints" (Plato, Phaedrus 265e), and there are plenty of edge cases.

About tokenization

A tokenized text is a nested structure in which the text consists of an ordered sequence of sentences, and each sentence consists of an ordered sequence of words. In the XML representation of the text, each word is contained by a <w> element, and punctuation is contained by a <pc> element. These <w> and <pc> elements are the "leaf nodes" or lowest point of a hierarchical or "tree" structure that ascends on a "path" through a series of nestings. A word in a play may sit at the bottom of the path "TEI/text/div/div/sp/l/w." In the MorphAdorned text, sentences are not enclosed in <s> tags that are stages in the Xpath, because sentences often cross the discursive boundaries established by elements, especially in verse. Sentence boundaries are marked by a <pc unit="sentence"> attribute, either attached to sentence-terminating punctuation, or an empty punctuation mark. Sentences can be identified and retrieved as the sequence of words between two words or punctuation marks with <pc unit ="sentence"> elements, except where text contained by certain "jump" tags such as <note> intrudes. Sentences can still be extracted by either physically or virtually (programmatically) relocating the text contained by the "jump" tags so that it no longer intrudes. MorphAdorner includes programs which do this in extracting plain (untagged) versions of the adorned texts for use by other non-XML-aware programs.

The xml:id and its complementary location id

MorphAdorner separates the act of tokenization from the act of linguistic adornment. A tokenized text (or part of it) can be re-adorned without affecting the original tokenization. Each <w> and <pc> element has an xml:id that is composed of a work ID and a running word counter that increments by 10 so that minor corrections can be accommodated without disrupting the sequence of ID's. For instance, falsely joined words are a very common occurrence in the TCP text. The correction of such a phenomenon (e.g. 'beginwith') involves the division of one token into two. If the original token count goes like "10, 20", splitting "beginwith" into two words with the id's 10 and 15 does not affect other IDs or their sequence.

Since the xml:ids are unique across the entire corpus they can be used to reference words in a document from other XML files, databases, or custom document types.

In addition to its xml:id MorphAdorner can generate a location ID as the 'n' attribute of <w> and <pc> elements. The purpose of this location ID is to facilitate alignment of the transcribed text with the page image, a key requirement for many forms of work with retro-digitized documents. The location ID is based on the page number of the digital scan, typically a double page. It is referenced in the SGML source text as the value of the REF attribute in <PB> elements and appears as the value of the 'facs' attribute in the P5 version. Page numbers of the printed source appear in the PB elements as the value of N attributes, but not all printed pages have running page numbers. The location ID uses 'a' and 'b' to distinguish the parts of a double-paged scan.

More precisely, the location ID takes the form facs-column-wordinpage where facs comes from the attributes of the enclosing <pb> element, column is a letter starting with "a" and giving the column number on the printer page, and wordinpage is the ordinal of the word within the page starting at 1 multiplied by the spacing. Subsequent location ID values have a wordinpage value incremented by the given spacing value, which is 10 by default. Optionally the work ID (usually the base file name) can be prepended to the location ID.

Here is a typical example of a location ID.

  • 2-a-0050

This refers to the first column, fifth word in page image 2 for the current work.

These can be long identifiers, but theoretically only the page-base counter needs to be recorded as an 'n' attribute. If page-based IDs are needed, they can be constructed on the fly or in a preprocessing step by concatenating the work ID, the attribute values of the <pb> element and the page counter. It may also be practical to construct an xml:id for each page by concatenating the workid with attribute values, as in <pb xml:id="A05137-025-051" facs="25" n="51" />

Tokenization and the apostrophe

The TCP corpora use the apostrophe character (Ascii 39) to represent both the apostrophe character proper and the single quote. The apostrophe symbol presents tricky problems for tokenization when it appears before or after a word. It may be an opening or closing quotation mark or it may be part of a contracted form like "'tis" or the possessive marker of a plural noun (sailors'). In the former cases it should be replaced by opening or closing quotation marks and be identified as a separate token. In the latter cases it should be counted as part of the word. It is possible to identify contracted forms with considerable precision. Apostrophes sometimes appear as leading quotation marks at the beginning of lines of verse.

Apostrophes are rare before the late seventeenth, but their disambiguation is a non-trivial problem in texts from the late 17th and 18th century, especially play texts or texts that contain conversation or informal correspondence. The relative frequency of apostrophes is a pretty good guide to texts of this kind.

MorphAdorner uses a table of common occurrences of words beginning or ending with an apostrophe to determine when to split or retain initial or trailing apostrophes from words. This is not completely accurate but the most common occurrences of tokens with leading or trailing apostrophes are correctly handled.

Tokenization and the mdash

The mdash (\u2014) is another symbol that complicates tokenization. It is very rare in 16th and early 17th century texts. Like the apostrophe, it belongs to the world of conversation and informal correspondence. In the SGML texts the mdash character entity is used both as a punctuation mark and as the symbol for "polite elision", e.g. "d-mn" or "B―p," etc. In this second use the mdash does not mark a token boundary. You can with tolerable accuracy distinguish between these two uses through a combination of algorithms and exception lists.

The SGML texts never use the horizontal bar (\u2015), which is visually indistinguishable from the mdash. It is therefore possible to use the horizontal bar to mark polite elision. This removes existing ambiguity without creating new forms of ambiguity. The replacement of the mdash with a horizontal could be explicitly recorded in a change log, but this is not strictly necessary since in the Morphadorned TCP texts the presence of \u2015 would by definition involve its replacement of "&mdash;" in the SGML source files.

MorphAdorner attempts to distinguish the cases where the mdash is a word separator from those where it should not be (as in polite elision). This cannot be done with high accuracy and some tokenization errors remain.

Periods and abbreviations

MorphAdorner's sentence splitter uses the ICU4JBreakIterator class (from the International Components for Unicode) along with a large set of heuristics for determining if two or more sentences generated by ICU4JBreakIterator should be joined into one sentence. The heuristics include special treatment of sentence-ending brackets (right parenthesis, right bracket, and right brace), abbreviations, and interjections.

Abbreviations are a source of many tokenization errors. The TCP texts include a great many scientific, theological, and other learned texts with thousands of obscure and rarely consistent abbreviations.

MorphAdorner includes an implementation of the Punkt algorithm which treats abbreviations as a special form of collocation in which a character string habitually collocates with a final period. Running the Punkt abbreviation detection algorithm over an entire corpus provides an initial, somewhat conservative list of abbreviations. The abbreviations produced by Punkt have proved to be genuine abbreviations, or at least strings to which the trailing period should remain attached (e.g., Roman numerals). Punkt misses some abbreviations, so the initial list requires manual enhancement.

Many of the most commonly missed abbreviations are Biblical references. Punkt relies on relative occurrences of tokens with and without trailing periods in order to determine which strings are probable abbreviations. Especially in the earlier EEBO texts, an abbreviated Biblical book may appear both with and without a trailing period - e.g., Corinthians may appear as both Cor and Cor.

It is important when tokenizing some kinds of texts to use different abbreviation lists for different parts of the text. For example, we used different abbreviation lists when adorning the main part of drama texts as opposed to the paratext (stage directions, speaker labels, etc.). MorphAdorner provides for using different abbreviation lists based upon tag classes.

Roman numerals

Roman numerals in older English language texts exhibit considerably more orthographic variation than contemporary usage allows. For example, the letters "j" and "u" are often substituted for "i" and "v". Runs of letters may exceed the nominal length, e.g., "iiiii" may be used where "v" would normally appear in current usage. Particularly in early modern texts, numerals may be preceded and/or followed by a period, as in ".XVI." Some Roman numerals are often followed by superscripted letters, as in "DCCXXV<sup>o</sup>," where the Latin inflection markers need to be stripped in order to retrieve the base form "DCCXXV". MorphAdorner attempts to recognize many of these variants.

It is sometimes difficult to distinguish algorithmically between different uses of strings such as "I." which may be a Roman numeral, an initial, or a personal pronoun. "D." may be a Roman numeral or an initial. Many of these problems occur around abbreviations in Biblical references. Disambiguating the usage is important to achieve accurate part of speech tagging.

Back-tick characters

The back-tick character ` (Ascii 96) appears in a number of texts in different contexts. When it occurs in the middle of a word, it acts as an alternative to the apostrophe. At the start of a line of verse the back-tick (or two back-ticks in sequence) functions as a kind of opening quote with no corresponding closing quote.

MorphAdorner treats two back-ticks as a single punctuation mark and splits them from the token to which they are attached. A single back-tick followed by a capital letter (ignoring any intermediate decorate tag such as <hi>) is treated as a separable token as well. This is correct more often than not. Other instances of the back-tick are left attached to the token in which they appear. The back-tick is regularized to an apostrophe when looking up spellings for purposes of lemmatization and part of speech tagging.

Edge cases of 'words' in MorphAdorned texts

The TEI Guidelines define the content of a <w> element as a "grammatical (not necessarily orthographic) word. While the blank space is the most common word boundary marker, a blank space does not always separate one word from another, and there are lexical items that may be spelled as a single word, two words, or hyphenated words. In the TCP texts there are three very common types of such lexical items: reflexive pronouns, British monetary terms, and the words 'today' , 'tomorrow', and 'yesterday'.

MorphAdorner handles these cases using pattern matching during the tokenization phase. Occurrences such as "my self" are treated as split words, and the individual parts are marked with the "part=" attribute of the <w> element to indicate this. Some special cases are handled. For example, in the phrase "from day to day" the "to" and "day" are individual words, not parts of a split word.

Reflexive pronouns

The reflexive pronouns 'myself', 'herself' etc. occur as hyphenated or single word spellings from very early on. The frequency of two-word spellings declines over time, but it is probably a mistake to use orthographic difference as a sufficient reason for analyzing "my self" as the sequence of a possessive pronoun and a noun, while 'myself' or 'my-self' are analyzed as reflexive pronouns. Mapping the different spellings to the same description and treating them as single lexemes still let an investigator pursue the question whether or how the decline of two-word spellings marks a change in the perception of these "words" as single or composite.

British monetary terms

The most common way of referring to pounds, shillings, and pence is to use the Latin abreviations 'l' or 'lb', 's', and 'd' preceded by a numeral, typically Arabic. There may or may not be a space between the numeral and the abbreviation. The abbreviation is often, but not always, marked by a period. The abbreviation may also appear as a superscript (more common with 'l' or 'lb' than with 's' and 'd').

If you look for monetary terms across many texts in the corpus, it is probably helpful to treat these different spellings as single monetary expressions and contain them in a single <w> element.

MorphAdorner attempts to locate occurrences of monetary patterns and encodes the variants which contain blanks as split words. This allows recovery of the joined word as a single unit. Unfortunately a fair number of occurrences of "l" (pound) following a number are encoded as the numeral "1" instead, complicating the recognition of the monetary pattern.

Today, tomorrow, and yesterday

The spellings of 'today', 'tomorrow', and 'yesterday' are all over the map in Early Modern English. As with reflexive pronouns, there is a trend away from two word spellings.

If 'to day' is treated as a single word, you need to watch out for the phrase 'from day to day,' where 'to day' is clearly not one word. MorphAdorner has a list of such exceptional cases.

Changes in the tokenized file

The tokenized file as the basis for linguistic adornment

The tokenized file serves as the basis for linguistic adornment. Some features of the SGML source file are very unlikely to be of further use in the adorned file and make working with it harder. They are removed at this stage. Because each word in the tokenized file has a unique xml:id, it is easy to log all the changes and "park" them in a change log file. Think of it as a form of tacit stand-off markup. It is tacit in the sense that the tokenized file need not include an explicit pointer to a record in the change log. But you can ask whether a given xml:id in the tokenized file has a corresponding record in the change log.

The character of the change log

Just about anything archived in a change log could also be stored as elements or attributes in the XML file. It is not expensive to store everything in one file, but bloated files are cumbersome to manipulate. The cost of storing such files may be trivial, but the cost -- in terms of time or complexity -- of manipulating them is not. You may want to look for some kind of "off-site" storage for features that will be used rarely, if ever. It is critical that such features can be retrieved with precision and ease. It is not critical that they are retrievable "on the fly."

The changes in question always involve the content of <w> and <pc> elements, and possibly associated <c> elements which enclose word-separating whitespace.

MorphAdorner uses a simple XML format to contain a list of token-based changes. The format of this file is as follows.

<ChangeLog>
  <changeTime>The time the change file was created.</changeTime>
  <changeDescription>A description of the changes.</changeDescription>
  <changes>
   <change>
     <id>xml:id of token to be changed.</id>
     <changeType>addition, modification, or deletion.</changeType>
     <fieldType>Type of field to change: text or attribute.</fieldType>
     <oldValue>Old field value.</oldValue>
     <newValue>New field value.</newValue>
     <siblingID>xml:id of sibling word for a word being added.</siblingID>
     <blankPrecedes>true if blank precedes the token, else false.</blankPrecedes>
   </change>
            ...
(more <change> entries)
            ...
  </changes>
</ChangeLog>

This simple XML formatted change file allows a file to be transformed to a corrected file using a utility in the MorphAdorner suite. A file can be "untransformed" from the corrected version to the uncorrected version using the same change file. A likely use case for the change log is an edition that wants to use long 's' and other original spellings.

Here is an example of a change log entry which records the replacement of a long "s" with a plain "s".

<ChangeLog>
  <changeTime>2013-07-09 13:04:17.149 CDT</changeTime>
  <changeDescription>Changes from \tokenized\K000379.000.xml to \tokenized-no-word-breaks\K000379.000.xml as determined by CompareAdornedFiles.</changeDescription>
  <changes>
    <change>
      <id>K000379_000-00080</id>
      <changeType>modification</changeType>
      <fieldType>text</fieldType>
      <oldValue>Addreſs'd</oldValue>
      <newValue>Address'd</newValue>
      <blankPrecedes>true</blankPrecedes>
    </change>
        ...
  </changes>
</ChangeLog>

Long 's'

The Unicode character for long 's' is replaced at this stage with plain 's'. For almost any conceivable inquiry, the presence of two different forms of 's' complicates analysis without compensating advantage. The word occurrences with long 's' are logged in the change log.

Soft hyphens

The soft hyphens of the SGML files are treated according to the following protocol:

  1. If a spelling with a soft hyphen occurs elsewhere in the work or corpus as an unhyphenated spelling, the soft hyphen is removed.
  2. If a spelling with a soft hyphen occurs elsewhere with a hyphen, the soft hyphen is replaced with a true hyphen.
  3. If a spelling with a soft hyphen does not occur elsewhere either in a hyphenated or unhyphenated form and both word parts can serve as independent words the soft hyphen is replaced with a true hyphen.
  4. If a spelling with a soft hyphen does not occur elsewhere either in a hyphenated or unhyphenated form and the word parts are not independent words the soft hyphen is removed.

This replacement algorithm is implemented in a post-processing step after all the XML files are tokenized. This is necessary to get the complete list of tokens for determining how often a word appears with or without a real hyphen in the corpus.

Character entities without corresponding utf-8 code points

In Michigan's display-oriented XML versions of the SGML texts, character entities without corresponding utf-8 points are represented through various workarounds, often enclosing these in curly braces, as in "cum{que}" which represents the SGML transcription "cum&abque," which represents "cum" plus a brevigraph for "que". Where the braces can be dropped without creating ambiguity or illegibility -- which is true of most cases -- they should be dropped with an appropriate record in the change log.

The horizontal bar as the marker of polite elision

As noted above, the SGML texts use the &mdash; character entity both as a punctuation mark and as a symbol for polite elision. Polite elision in the MorphAdorned files should be marked by the horizontal bar. We did not make this change in our initial conversion, but will consider doing it in the future.

Decorator characters

The underscore character identifying the initial character in a section or paragraph as decorated is removed, but logged. A rend attribute with the value initialchardecorated is added to the word element.

hi tags inside words

In the SGML texts <HI> tags sometimes begin or end in the middle of a word, reflecting common practices of Early Modern printing. In the possessive form of a name, the root is often italicized while the case ending is in plain type: "<hi>Caesar</hi>'s death". If the surrounding text is in italics and the name is emphasized through small caps, the SGML text is likely to represent that as "Caesar<hi>'s death</hi>. If you tokenize the text, the <w> element straddles different tags, requiring complex procedure of splitting and joining word parts.

One way of avoiding this problem in the first place is to move the information from the <HI> tag of the SGML text "atomically" into the "rend" attribute of the <w> elements. That way all information about formating is kept at the same and lowest level. If you want to ignore it you can, and it will never be in the way because there is never any data below it whose treatment depends on what you do or do not do with the formatting data.

Because most <hi> elements consist of a single word or two words, recording information about the <hi> status of a <w> element "atomically" as the value of a rend attribute does not make the file significantly more verbose. But it gives you an opportunity to use "hybrid" rend attributes to describe words, parts of which are wrapped in <hi>tags.

Consider the most common case: <hi>Caesar</hi>'s death. This is tokenized as

<w xml:id ="someid1" rend="plain_apostrophe">Caesar's</w>

<w xnk:id = "someid2">death</w>

The attribute value is part of a controlled vocabulary of about two dozen cases, and it means that a highlighted name is followed by a possessive in plain type. This preserves the formatting information as it appears in the SGML texts. The change can be performed in a post-processing step either before adorning the tokenized files, or after adornment is complete.

While we have not yet worked out all the details of hybrid hi tags, there are three basic cases with various subdivisions:

  1. Two parts of a word are in hi tags, but a middle connecting string is not (rare)
  2. The first part of a word is inside a <hi> tag
  3. The second part of a word is inside a <hi> tag

In addition, there are a small number of cases of nested hi tags that must be handled. In a future release of MorphAdorner we expect to include a utility which replaces most hi tags in adorned or tokenized texts with rend= attributes in the word elements.

Post-processing the tokenized file

The tokenized file is post-processed to mark words containing gaps and to replace soft huphens with real hyphens or to remove them, as described above.

Adding type="unclear" to words containing gap characters

A type="unclear" attribute is added to any <w> element for a word or word part containing one or more gap characters ● (Unicode u25CF).

Other token-based changes

A post-tokenization program replaces the long "s" with plain "s" and removes braces surrounding brace-enclosed entities. A change can be created at this point to allow undoing these changes.

The process of linguistic adornment

Following the tokenization post-processing phase, each TEI XML file is lingistically adorned.

The pivotal position of the tokenized but not yet adorned file

The tokenized, but not yet adorned, interim P5 version of an SGML text has a pivotal position in the workflow that leads from SGML texts to linguistically adorned files. This version does not shed any data from the SGML text, but it identifies textual features that will be removed or changed, with all changes logged in a manner that allows backtracking to the SGML file.

This interim file is linked to the linguistically adorned file with its enrichments and simplifications through the stable system of xml:ids for each <w> element. There will be some changes in some xml:ids as errors are discovered and fixed, and the maintenance of the link between the first tokenized version and its adorned derivatives cannot be taken for granted. It needs attention, but it is a realistic assumption that it can be maintained.

The separation of tokenization from adornment is a key feature of MorphAdorner 2.0. It allows for work flows that are more granular and iterative, supporting cumulative improvements over time. Many data errors or opportunities for data enrichment are discovered in the process of working with data. While MorphAdorner does not by itself create a collaborative curation environment, its data structure and basic work flows are useful building blocks for such an environment.

Linguistic adornment

MorphAdorner associates every word occurrence with a lemma and a part of speech tag. From this combination, which we call the 'lempos' you can derive a standard spelling: if 'louyth' is identifed as an instance of 'love_vvz' you can algorithmically derive 'loveth' or 'loves' as the desired standardized spellings.

Errata divs

A number of TCP texts include div elements containing errata. The content of errata divs is generally not amenable to linguistic adornment. We mark all the non-punctuation tokens in errata divs with the NUPos "zz" part of speech for "unrecognizable."

Output formats

Native output

MorphAdorner produces a variety of outputs for adorned and unadorned texts as well as textual derivatives.

MorphAdorner's basic or native output format stores all its adornments as attribute values of a <w> or <pc> element. The principal nearly P5 compatible format uses the standard ana= and lemma= attributes to store the parts of speech and lemmata, respectively, adds a non-standard reg= attribute to hold the standardized spelling. Using an attribute is preferable to a choice element because the attribute leaves the token sequence undisturbed, and the added attribute value can be stored in the standard MorphAdorner change log.

Tabular output

For the purpose of reviewing and correcting data, MorphAdorner's tabular output is very helpful. This output contains the following (among other items) as columns in a table:

  • The corpus-wide xml:id
  • The spelling
  • The lemma
  • The POS tag
  • The token before
  • The token after
  • Up to 80 characters before
  • Up to 80 characters after
  • The highest level differentiating element (front, body, back)
  • The parent element of <w>

Here is an example, with the spelling put between the before and after contexts:

K133535.000- 052790

base

j

so

a

the twelve thousand Hessians , sold in so

b●se

a manner by their avaricious master to the

body

p

The example shows one of the several million incompletely transcribed words. In this, as in many other cases, the correct reading can be supplied by a literate reader with complete confidence and without consulting the page image. It is relatively straightforward to populate a database with tabular output containing only incomplete readings and adding a data entry capability that lets users log in and provide corrections. See, for example, Annolex, which also supports easy consultation of the page images in the many cases where that is necessary.

The main point here is that the MorphAdorner data structure provides a very robust foundation for collaborative improvement of the EEBO texts over time and by many hands. Central to this task is the maintenance of stable ID's that are the bread crumbs through which user-generate corrections can be tracked back to their source texts.

TEI compliant output

The simplest out-of-the-box version of MorphAdorned and TEI P5 compliant texts follows a format very close to the British National Corpus: the word token is the content of a <w> element. The lemma and POS tag are respectively stored in 'lemma' and 'ana' attributes. In out-of-the-box P5 you cannot store a standardized spelling in a 'reg' attribute. On the other hand, you can use a combination of <choice> , <orig> , and <reg> elements to make each <w> element carry its part of a double stream of original and standardized spellings, as in this adorned encoding of "wylle anone" from an early 16th century text:

<w xml:id ="someid1" lemma="will" ana="#vmb">

<choice>
<orig>wylle</orig>
<reg>will</reg>
</choice>
</w>
<w xml:id ="someid2" lemma="anon" ana="#av">>
<choice>
<orig>anone</orig>
<reg>anon</reg>
</choice>
</w>

Alternately, you can customize P5 and restore a 'reg' attribute that would let you encode the same phenomena in a manner that programmers -- and in particular programmers with limited skills -- are likely to find more intuitive and economical:

<w xml:id ="someid1" lemma="will" reg= "will" ana="#vmb">wylle</w>
<w xml:id ="someid2" lemma="anon" reg ="anon" ana="#av">anone</w>

Either way the linguistic adornment of consistently encoded TEI texts provides users with rich opportunities for combining lexical and morphological features with broader discursive features in their analysis of texts. MorphAdorner can generate either style of output for regular spellings.

Other output formats

MorphAdorner can also generate other types of output from adorned files, including various types of plain text, summary tabular files, and input for the Corpus Workbench, Sketch Engine, and BlackLab search engine.

NUPos interpGrp

The TEI P5 guidelines suggest including an interpGrp section to define the part of speech tags referenced by ana= attributes in word elements. Here is part of the interpGrp for the NUPOS part of speech tag set used by MorphAdorner.

<interpGrp type="NUPOS">
  <interp xml:id="a-acp">acp word as adverb</interp>
  <interp xml:id="av">adverb</interp>
  <interp xml:id="av-an">noun-adverb as adverb</interp>
  <interp xml:id="av-c">comparative adverb</interp>
  <interp xml:id="av-d">determiner/adverb as adverb</interp>
  <interp xml:id="av-dc">comparative determiner/adverb as adverb</interp>
  <interp xml:id="av-ds">superlative determiner as adverb</interp>
  <interp xml:id="av-dx">negative determiner as adverb</interp>
  <interp xml:id="av-j">adjective as adverb</interp>
    ...
  <interp xml:id="zz">unknown or unparsable token</interp>
</interpGrp>

We pondered how best to include this interpGrp in the adorned output files produced by MorphAdorner. A prolix approach adds the interpGrp to every adorned file in full. A sparer approach uses the xinclude facility to reference the same external copy from each adorned file. The question remained, where to put the full definition or include statement?

We considered placing the definition in the TEI header, or someplace in the body of the text. We finally decided to wait until the forthcoming TEI standoff tag becomes officially available. The standoff tag acts as a container for storing various kinds of standoff markup. This separates the standoff items from the actual text of the document. Hence the currently generated adorned output files do not include the NUPOS interpGrp.

Placement of notes

In the printed source texts, <note> elements never interrupt the reading order, because all the notes are either placed in the margin or at the bottom of the page. In the SGML texts, the <note> elements were encoded inline, because that was the most convenient thing to do.

For the purpose of maintaining continuity with the SGML source, the interim tokenized text needs to preserve their current position. The final output, however, can employ a variety of stand-off options and keep notes in special divs, whether at the end of each div or in a <back> element of each <text> element.

Consultation with several linguists suggests that there is some consensus about keeping notes out of the flow of the text and that, as Bryan Jurish put it in an email, there is much to be said for "the underlying intuition that a 'stupid' extraction of the raw text from an XML document (i.e. the concatenation all text nodes in document order) ought to return a linguistically plausible serial representation of the data."

MorphAdorner internally moves the content of <note> elements and other jump tags out of the way during adornment. This allows for proper part of speech tagging within the main text with the intrusive jump tag text getting in way of a proper reading order for words.

MorphAdorner also provides utilities for extracting the plain text of words or sentences that is cognizant of proper reading order as well. This allows for extracting random sentences, or generating input to programs such as Mallet to perform topic extraction.

MorphAdorner also contains a program which can reorganize adorned files so that notes are moved to a <div type="notes"> in the <back> section at the end of the main <div> in which they occur. Original instances of the notes are replaced by a <ptr> element which points to the location of the relocated <note> element. An example of such a <ptr> element is:

<ptr type="note" target="nd1e8415" xml:id="rd1e8415" n="1"/>

The target= attribute gives the xml:id of the transplanted <note>. The xml:id provides the back link needed to restore the original note position give the transplanted note.

Searching the corpora

Given a richly adorned corpus, the ability to search both the text and the adornments comprise an important basis for any research using that corpus. MorphAdorner itself does not provide such a search facility. Instead the assumption is that the adorned texts, or a suitable transformation of them, will comprise the input to one or more corpus search engines.

As noted in previous sections, MorphAdorner can transform the base adorned files to the input format required by some corpus search engines such as the Corpus WorkBench and the Sketch engine. The Philogic v4.0 search engine can index and search MorphAdorned XML files directly. Adorned files can also be used as input to generic XQUERY search engines such as BaseX and eXist. In addition, during the course of this project, we discovered the availability of a new corpus search engine called BlackLab, under development by the Institute of Dutch Lexicology (INL). An experimental search site built using BlackLab hosts adorned versions of Shakespeare's dramas and the TCP ECCO texts at http://devadorner.northwestern.edu/corpussearch/ .

BlackLab is a corpus retrieval engine built on top of the popular open-source search library Apache Lucene. According to its authors, BlackLab "offers fast, complex searches with accurate hit highlighting on large, tagged and annotated, bodies of text." BlackLab extends Lucene with the ability to use a variant of the Corpus Workbench search syntax to search adorned corpora for attributes such as lemma, part of speech, main versus paratext, and most any other token-level attribute one can imagine.

We heavily modified and enhanced a basic TEI corpus indexer provided by the BlackLab development group. We used this to create searchable versions of the MorphAdorner generated adorned files for all of the project corpora. The speed of both the indexing process and the searches is impressive. The BlackLab searches have allowed us to locate and correct a variety of adornment problems that would otherwise have been more difficult to find.

MorphAdorner Server

The MorphAdorner Server wraps adornment processes as web services using Rest-like interfaces. The web services can be accessed from any programming language and system which knows how to send and receive HTTP requests, or even a plain web browser. The services are automatically parallelized because of the way HTTP servers work. Many clients can access the same web service simultaneously. MorphAdorner Server uses the Restlet library to implement the web services.

Some of the TEI-based services currently provided by the MorphAdorner Server include:

  • Convert an adorned TEI XML to tabular format.
  • Adorn a TEI XML file.
  • Apply a change log to an adorned TEI XML file.
  • Compare two versions of an adorned TEI XML file and generate an XML format change log.
  • Extract text from a TEI XML file.
  • Extract sentences from adorned TEI XML file.
  • Move notes to a special div in a TEI XML file.
  • Tokenize a TEI XML file.
  • Unadorn a TEI XML file.

Future directions

We hope the initial work we've done on the TCP texts will continue in subsequent projects. Aside from improving the tokenization and morphological adornment, we expect to merge meta-data from the electronic version of the English Short Title Catalog into the metadata sections of the TEI XML files, perhaps into the <keywords> sections of the TEI header. This will improve the ability of researchers to search for and create corpus subsets of interest as well as allow comparison with other corpora.

We also hope to be able to identify named entities of various types, including personal names and places. This is more difficult in literary texts than in other discursive writing because many of the names are entirely fictional. Many TCP texts contain Biblical references and references to classical authors. It would be useful to mark these using the xml:id of the tokens comprising the entities, perhaps saving them in stand-off form in auxiliary documents.

A longer term goal is the compilation of a comprehensive lexicon of spellings and variants with date information from the TCP texts. The lexicon would include frequencies of occurrence across centuries, broken down by genre and part of speech, as well as lemmata by parts of speech. Such a lexicon would allow morphological adornment processes to use a standardized lexicon ID for each word in a text.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk