public class EEBOWordTokenizer extends DefaultWordTokenizer implements WordTokenizer
Do not use this when EEBO texts have been converted to TEIAnalytics format.
Modifier and Type | Field and Description |
---|---|
protected static java.util.regex.Matcher |
numberDotSpellingMatcher |
protected static java.util.regex.Pattern |
numberDotSpellingPattern
Pattern to match number.word
|
protected static java.util.regex.Matcher |
underlineCapCapMatcher |
protected static java.util.regex.Pattern |
underlineCapCapPattern
Pattern to match _CapCap
|
abbreviations, aposTokens, apostropheCanBeQuote, coalesceAsterisks, coalesceHyphens, contractions, contractionsURL, hyphensMatcher, hyphensPattern, logger, preTokenizer
Constructor and Description |
---|
EEBOWordTokenizer()
Create EEBO word tokenizer.
|
Modifier and Type | Method and Description |
---|---|
java.lang.String |
preprocessToken(java.lang.String token,
java.util.List<java.lang.String> tokenList)
Preprocess a word token.
|
addWordToSentence, extractWords
findWordOffsets, getLogger, getPreTokenizer, isClosingQuote, isLetterOrSingleQuote, isMultipleHyphens, isSingleOpeningQuote, loadContractions, setAbbreviations, setAposTokens, setLogger, setPreTokenizer, splitToken
close
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
addWordToSentence, close, extractWords, findWordOffsets, getPreTokenizer, setAbbreviations, setAposTokens, setPreTokenizer
close
protected static final java.util.regex.Pattern numberDotSpellingPattern
protected static final java.util.regex.Matcher numberDotSpellingMatcher
protected static java.util.regex.Pattern underlineCapCapPattern
protected static final java.util.regex.Matcher underlineCapCapMatcher
public java.lang.String preprocessToken(java.lang.String token, java.util.List<java.lang.String> tokenList)
preprocessToken
in interface WordTokenizer
preprocessToken
in class AbstractWordTokenizer
token
- Token to preprocess.tokenList
- List of previous tokens already issued.