public class OrigFixer
extends java.lang.Object
<orig> tags are used in Wright archive documents to mark words split across a page break. A typical page break appears as follows:
<orig reg="larboard" TEIform="orig">lar-</orig> <pb TEIform="pb"/> board hand,
The "reg=" attribute of the <orig> tag provides the original unsplit spelling of the word which is split across the page boundary. The whitespace following the </orig> and preceding the <pb>, as well as the whitespace following the </pb>, causes MorphAdorner to process the split word incorrectly as multiple words instead of a single word.
OrigFixer modifies the XML for <orig> tags as follows.
For example, OrigFixer modifies the sample text above to read:
<orig reg="larboard" TEIform="orig">lar?</orig><pb TEIform="pb"/>board hand,
These modifications allow MorphAdorner to process the split word correctly. The MorphAdorner tokenizers recognize the special substitute hyphen character, which is restored to a plain hyphen character by the XML output writers.
Modifier and Type | Class and Description |
---|---|
static class |
OrigFixer.OrigProcessor
JDOM element processor which fixes
|
Modifier | Constructor and Description |
---|---|
protected |
OrigFixer()
Allow overrides but no instantiation.
|
Modifier and Type | Method and Description |
---|---|
static void |
fixOrigs(org.jdom2.Document document)
Fix
|