Northwestern University Information Technology
The Text Creation Partnership (TCP) transcriptions do not record line breaks in the printed originals. They do, however record "soft" hyphens where a word straddles two lines. The pipe character or vertical bar is used to mark such line breaks as in "wind|ing".
Word breaks at line endings are not always marked with a hyphen in the printed originals. Transcribers were asked to supply missing soft hyphens with a '+' sign. Sometimes they did, sometimes they didn't. Unmarked word breaks, especially in marginal notes, are a very common feature of the TCP texts.
The soft hyphens of the SGML transcriptions of the printed texts are treated according to the following protocol after conversion to TEI XML format.
This replacement algorithm is implemented by a sequence of utilities after all the XML files are tokenized. This is necessary to get the complete list of tokens for determining how often a word appears with or without a real hyphen in the corpus. These utilities are applied only for TCP texts and are not particularly useful in general.
|Announcements and News
|Announcements and news about changes to MorphAdorner
|Documentation for using MorphAdorner
|Downloading and installing the MorphAdorner client and server software
|Glossary of MorphAdorner terms
|Natural language processing references
|Licenses for MorphAdorner and Associated Software
|Online examples of MorphAdorner Server facilities.
|Slides from talks about MorphAdorner.
|Technical information for programmers using MorphAdorner
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |