The tcp package contains utilities aimed at processing Text Creation Partnership texts.
- AddUnclear adds a type="unclear" attribute to word tokens
which contain a character gap marker. Character gaps are indicated by the presence
of unicode character \u25CF (the black circle) in a token.
- CountDividedWords counts word tokens which contain
soft line break divider characters.
- FindSoftHyphens
- ExtractSoftHyphens
- FixWordBreaks
- RemoveCruft removes "cruft" such as long "s", superscript markers,
and other TCP specific markup.