NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Processing Soft Hyphens

The Text Creation Partnership (TCP) transcriptions do not record line breaks in the printed originals. They do, however record "soft" hyphens where a word straddles two lines. The pipe character or vertical bar is used to mark such line breaks as in "wind|ing".

Word breaks at line endings are not always marked with a hyphen in the printed originals. Transcribers were asked to supply missing soft hyphens with a '+' sign. Sometimes they did, sometimes they didn't. Unmarked word breaks, especially in marginal notes, are a very common feature of the TCP texts.

The soft hyphens of the SGML transcriptions of the printed texts are treated according to the following protocol after conversion to TEI XML format.

  1. If a spelling with a soft hyphen occurs elsewhere in the work or corpus as an unhyphenated spelling, the soft hyphen is removed.
  2. If a spelling with a soft hyphen occurs elsewhere with a hyphen, the soft hyphen is replaced with a true hyphen.
  3. If a spelling with a soft hyphen does not occur elsewhere either in a hyphenated or unhyphenated form and both word parts can serve as independent words the soft hyphen is replaced with a true hyphen.
  4. If a spelling with a soft hyphen does not occur elsewhere either in a hyphenated or unhyphenated form and the word parts are not independent words the soft hyphen is removed.

This replacement algorithm is implemented by a sequence of utilities after all the XML files are tokenized. This is necessary to get the complete list of tokens for determining how often a word appears with or without a real hyphen in the corpus. These utilities are applied only for TCP texts and are not particularly useful in general.

  1. Count words with word breaks using edu.northwestern.at.morphadorner.tools.tcp.CountDividedWords.
  2. Figure out which words should have word breaks using edu.northwestern.at.morphadorner.tools.tcp.FindSoftHyphens then edu.northwestern.at.morphadorner.tools.tcp.ExtractSoftHyphens.
  3. Substitute real hyphens for soft hyphens in words which should be hyphenated. Other soft hyphens are removed: edu.northwestern.at.morphadorner.tools.tcp.FixWordBreaks.
Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk