public class EEBOWordTokenizer extends DefaultWordTokenizer implements WordTokenizer
Do not use this when EEBO texts have been converted to TEIAnalytics format.
| Modifier and Type | Field and Description |
|---|---|
protected static java.util.regex.Matcher |
numberDotSpellingMatcher |
protected static java.util.regex.Pattern |
numberDotSpellingPattern
Pattern to match number.word
|
protected static java.util.regex.Matcher |
underlineCapCapMatcher |
protected static java.util.regex.Pattern |
underlineCapCapPattern
Pattern to match _CapCap
|
abbreviations, aposTokens, apostropheCanBeQuote, coalesceAsterisks, coalesceHyphens, contractions, contractionsURL, hyphensMatcher, hyphensPattern, logger, preTokenizer| Constructor and Description |
|---|
EEBOWordTokenizer()
Create EEBO word tokenizer.
|
| Modifier and Type | Method and Description |
|---|---|
java.lang.String |
preprocessToken(java.lang.String token,
java.util.List<java.lang.String> tokenList)
Preprocess a word token.
|
addWordToSentence, extractWordsfindWordOffsets, getLogger, getPreTokenizer, isClosingQuote, isLetterOrSingleQuote, isMultipleHyphens, isSingleOpeningQuote, loadContractions, setAbbreviations, setAposTokens, setLogger, setPreTokenizer, splitTokencloseclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitaddWordToSentence, close, extractWords, findWordOffsets, getPreTokenizer, setAbbreviations, setAposTokens, setPreTokenizercloseprotected static final java.util.regex.Pattern numberDotSpellingPattern
protected static final java.util.regex.Matcher numberDotSpellingMatcher
protected static java.util.regex.Pattern underlineCapCapPattern
protected static final java.util.regex.Matcher underlineCapCapMatcher
public java.lang.String preprocessToken(java.lang.String token,
java.util.List<java.lang.String> tokenList)
preprocessToken in interface WordTokenizerpreprocessToken in class AbstractWordTokenizertoken - Token to preprocess.tokenList - List of previous tokens already issued.