public abstract class AbstractWordTokenizer extends IsCloseableObject implements WordTokenizer, IsCloseable, UsesLogger
Modifier and Type | Field and Description |
---|---|
protected Abbreviations |
abbreviations
Abbreviations.
|
protected AposTokens |
aposTokens
Apostrophe tokens.
|
protected boolean |
apostropheCanBeQuote
True if apostrophes can be single quotes.
|
protected boolean |
coalesceAsterisks
True to coalesce adjacent asterisks.
|
protected boolean |
coalesceHyphens
True to coalesce adjacent hyphens.
|
protected TaggedStrings |
contractions
List of words starting with & or ' which should not be split.
|
protected java.lang.String |
contractionsURL
URL for List of words starting with & or ' .
|
protected static java.util.regex.Matcher |
hyphensMatcher |
protected static java.util.regex.Pattern |
hyphensPattern
Pattern for 2 or more hyphens.
|
protected Logger |
logger
Logger used for output.
|
protected PreTokenizer |
preTokenizer
The preTokenizer used here,
|
Constructor and Description |
---|
AbstractWordTokenizer()
Create a word tokenizer.
|
Modifier and Type | Method and Description |
---|---|
void |
addWordToSentence(java.util.List<java.lang.String> sentence,
java.lang.String word)
Add word to list of words in sentence.
|
abstract java.util.List<java.lang.String> |
extractWords(java.lang.String text)
Break text into word tokens.
|
int[] |
findWordOffsets(java.lang.String sentenceText,
java.util.List<?> words)
Find starting offsets of words in a sentence.
|
Logger |
getLogger()
Get the logger.
|
PreTokenizer |
getPreTokenizer()
Get the preTokenizer.
|
protected boolean |
isClosingQuote(char ch)
Is character a closing quote?
|
protected boolean |
isLetterOrSingleQuote(char ch)
Is character a letter or a single quote?
|
boolean |
isMultipleHyphens(java.lang.String s)
True if string contains only 2 or more hyphens.
|
boolean |
isSingleOpeningQuote(char ch)
True if character is a single opening quote.
|
protected void |
loadContractions()
Load list of non-breakable words and contractions.
|
java.lang.String |
preprocessToken(java.lang.String token,
java.util.List<java.lang.String> tokenList)
Preprocess a word token.
|
void |
setAbbreviations(Abbreviations abbreviations)
Set abbreviations.
|
void |
setAposTokens(AposTokens aposTokens)
Set apostophe tokens.
|
void |
setLogger(Logger logger)
Set the logger.
|
void |
setPreTokenizer(PreTokenizer preTokenizer)
Set the preTokenizer.
|
protected java.lang.String[] |
splitToken(java.lang.String token)
Split a token if necessary.
|
close
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
close
close
protected PreTokenizer preTokenizer
protected TaggedStrings contractions
protected java.lang.String contractionsURL
protected Logger logger
protected Abbreviations abbreviations
protected AposTokens aposTokens
protected boolean coalesceHyphens
protected boolean coalesceAsterisks
protected boolean apostropheCanBeQuote
protected static final java.util.regex.Pattern hyphensPattern
protected static final java.util.regex.Matcher hyphensMatcher
public Logger getLogger()
getLogger
in interface UsesLogger
public void setLogger(Logger logger)
setLogger
in interface UsesLogger
logger
- The logger.public void setAbbreviations(Abbreviations abbreviations)
setAbbreviations
in interface WordTokenizer
abbreviations
- Abbreviations.public void setAposTokens(AposTokens aposTokens)
setAposTokens
in interface WordTokenizer
aposTokens
- Apostrophe tokens.public PreTokenizer getPreTokenizer()
getPreTokenizer
in interface WordTokenizer
public void setPreTokenizer(PreTokenizer preTokenizer)
setPreTokenizer
in interface WordTokenizer
preTokenizer
- The preTokenizer.protected void loadContractions()
public java.lang.String preprocessToken(java.lang.String token, java.util.List<java.lang.String> tokenList)
preprocessToken
in interface WordTokenizer
token
- Token to preprocess.tokenList
- List of previous tokens already issued.public boolean isSingleOpeningQuote(char ch)
ch
- Character to check for being a single opening quote.protected boolean isLetterOrSingleQuote(char ch)
ch
- Character.protected boolean isClosingQuote(char ch)
ch
- Character.protected java.lang.String[] splitToken(java.lang.String token)
token
- Token to split.public void addWordToSentence(java.util.List<java.lang.String> sentence, java.lang.String word)
addWordToSentence
in interface WordTokenizer
sentence
- Result sentence.word
- Word to add.public int[] findWordOffsets(java.lang.String sentenceText, java.util.List<?> words)
findWordOffsets
in interface WordTokenizer
sentenceText
- Text from which tokens were
extracted.words
- List of words extracted from
sentence text.
N.B. If the words aren't from
the specified sentence text,
the resulting offsets will be
meaningless.public boolean isMultipleHyphens(java.lang.String s)
s
- String to check for hyphens.public abstract java.util.List<java.lang.String> extractWords(java.lang.String text)
extractWords
in interface WordTokenizer
text
- Text to break into word tokens.Word tokens may be words, numbers, punctuation, etc.