public abstract class AbstractPreTokenizer extends IsCloseableObject implements PreTokenizer, UsesLogger
Modifier and Type | Field and Description |
---|---|
protected static java.lang.String |
alwaysSeparators
Pattern to match characters which are always separators.
|
protected static PatternReplacer |
alwaysSeparatorsReplacer
Always Separators replacer pattern.
|
protected static java.lang.String |
asterisks
Pattern to match one or more asterisk.
|
protected static java.lang.String |
commaSeparator
Pattern to match comma as a separator.
|
protected static PatternReplacer |
commaSeparatorReplacer
Comma separator replacer pattern.
|
protected static java.lang.String |
hyphens
Pattern to match two or more hyphens in a row.
|
protected Logger |
logger
Logger used for output.
|
protected static java.lang.String |
periods
Pattern to match three or more periods.
|
Constructor and Description |
---|
AbstractPreTokenizer()
Create a preTokenizer.
|
Modifier and Type | Method and Description |
---|---|
Logger |
getLogger()
Get the logger.
|
java.lang.String |
pretokenize(java.lang.String line)
Prepare text for tokenization.
|
void |
setLogger(Logger logger)
Set the logger.
|
close
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
close
protected static final java.lang.String periods
protected static final java.lang.String asterisks
protected static final java.lang.String hyphens
protected static final java.lang.String commaSeparator
protected Logger logger
protected static final java.lang.String alwaysSeparators
Unicode ? (BLACKCIRCLE) is the dot character which marks character lacunae. This is not a token separator. Neither is Unicode • (SOLIDCIRCLE) which was used in the old EEBO format TCP files to mark character lacunae.
Unicode ?, the non-breaking hyphen, is not treated as a token separator.
Unicode ? (DEGREES_MARK) is degrees quote symbol. Unicode ? (MINUTES_MARK) is minutes quote symbol. Unicode ? (SECONDS_MARK) is seconds quote symbol. These are not token separators.
Unicode ‘ (LSQUOTE) is left single curly quote. Unicode ’ (RSQUOTE) is right single curly quote. These may or may not be token separators. It is up to the word tokenizer to decide.
Unicode “ (LDQUOTE) is left double curly quote. Unicode ” (RDQUOTE) is right double curly quote. These are token separators.
protected static PatternReplacer alwaysSeparatorsReplacer
protected static PatternReplacer commaSeparatorReplacer
public Logger getLogger()
getLogger
in interface UsesLogger
public void setLogger(Logger logger)
setLogger
in interface UsesLogger
logger
- The logger.public java.lang.String pretokenize(java.lang.String line)
pretokenize
in interface PreTokenizer
line
- The text to prepare for tokenization,