public interface WordTokenizer
Modifier and Type | Method and Description |
---|---|
void |
addWordToSentence(java.util.List<java.lang.String> sentence,
java.lang.String word)
Add word to list of words in sentence.
|
void |
close()
Close down the word tokenizer.
|
java.util.List<java.lang.String> |
extractWords(java.lang.String text)
Break text into word tokens.
|
int[] |
findWordOffsets(java.lang.String sentenceText,
java.util.List<?> words)
Find starting offsets of words in a sentence.
|
PreTokenizer |
getPreTokenizer()
Get the preTokenizer.
|
java.lang.String |
preprocessToken(java.lang.String token,
java.util.List<java.lang.String> tokenList)
Preprocess a word token.
|
void |
setAbbreviations(Abbreviations abbreviations)
Set abbreviations.
|
void |
setAposTokens(AposTokens aposTokens)
Set apostophe tokens.
|
void |
setPreTokenizer(PreTokenizer preTokenizer)
Set the preTokenizer.
|
PreTokenizer getPreTokenizer()
void setPreTokenizer(PreTokenizer preTokenizer)
preTokenizer
- The preTokenizer.void setAbbreviations(Abbreviations abbreviations)
abbreviations
- Abbreviations.void setAposTokens(AposTokens aposTokens)
aposTokens
- Apostrophe tokens.void addWordToSentence(java.util.List<java.lang.String> sentence, java.lang.String word)
sentence
- Result sentence.word
- Word to add.java.util.List<java.lang.String> extractWords(java.lang.String text)
text
- Text to break into word tokens.Word tokens may be words, numbers, punctuation, etc.
int[] findWordOffsets(java.lang.String sentenceText, java.util.List<?> words)
sentenceText
- Text from which tokens were
extracted.words
- List of words extracted from
sentence text.
N.B. If the words aren't from
the specified sentence text,
the resulting offsets will be
meaningless.java.lang.String preprocessToken(java.lang.String token, java.util.List<java.lang.String> tokenList)
token
- Token to preprocess.tokenList
- List of previous tokens already issued.void close()