public class ICU4JBreakIteratorWordTokenizer extends AbstractWordTokenizer implements WordTokenizer, CanTokenizeWhitespace, CanSplitAroundPeriods
Modifier and Type | Field and Description |
---|---|
protected java.util.Locale |
locale
Locale.
|
protected boolean |
mergeWhitespaceTokens
Merge whitespace tokens.
|
protected boolean |
splitAroundPeriods
Check for potential splitting of tokens around periods.
|
protected boolean |
storeWhitespaceTokens
Store whitespace tokens.
|
protected java.lang.String |
wordBreakRulesFileName
Word break rules template file.
|
protected com.ibm.icu.text.BreakIterator |
wordIterator
The word based break iterator.
|
abbreviations, aposTokens, apostropheCanBeQuote, coalesceAsterisks, coalesceHyphens, contractions, contractionsURL, hyphensMatcher, hyphensPattern, logger, preTokenizer
Constructor and Description |
---|
ICU4JBreakIteratorWordTokenizer()
Create a word tokenizer that uses the ICU4J word break iterator.
|
ICU4JBreakIteratorWordTokenizer(java.util.Locale locale)
Create a word tokenizer that uses the ICU4J word break iterator.
|
Modifier and Type | Method and Description |
---|---|
protected void |
createWordIterator()
Create word based break iterator.
|
java.util.List<java.lang.String> |
extractWords(java.lang.String text)
Break text into word tokens.
|
boolean |
getMergeWhitespaceTokens()
Get merge whitespace tokens.
|
boolean |
getSplitAroundPeriods()
Get splitting around periods.
|
boolean |
getStoreWhitespaceTokens()
Get store whitespace tokens.
|
void |
setMergeWhitespaceTokens(boolean mergeWhitespaceTokens)
Set merge whitespace tokens.
|
void |
setSplitAroundPeriods(boolean splitAroundPeriods)
Set splitting around periods.
|
void |
setStoreWhitespaceTokens(boolean storeWhitespaceTokens)
Set store whitespace tokens.
|
addWordToSentence, findWordOffsets, getLogger, getPreTokenizer, isClosingQuote, isLetterOrSingleQuote, isMultipleHyphens, isSingleOpeningQuote, loadContractions, preprocessToken, setAbbreviations, setAposTokens, setLogger, setPreTokenizer, splitToken
close
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
addWordToSentence, close, findWordOffsets, getPreTokenizer, preprocessToken, setAbbreviations, setAposTokens, setPreTokenizer
close
protected java.util.Locale locale
protected boolean storeWhitespaceTokens
protected boolean mergeWhitespaceTokens
protected boolean splitAroundPeriods
protected com.ibm.icu.text.BreakIterator wordIterator
protected java.lang.String wordBreakRulesFileName
public ICU4JBreakIteratorWordTokenizer()
public ICU4JBreakIteratorWordTokenizer(java.util.Locale locale)
locale
- Locale to use for tokenization.public boolean getStoreWhitespaceTokens()
getStoreWhitespaceTokens
in interface CanTokenizeWhitespace
public void setStoreWhitespaceTokens(boolean storeWhitespaceTokens)
setStoreWhitespaceTokens
in interface CanTokenizeWhitespace
public boolean getMergeWhitespaceTokens()
getMergeWhitespaceTokens
in interface CanTokenizeWhitespace
public void setMergeWhitespaceTokens(boolean mergeWhitespaceTokens)
setMergeWhitespaceTokens
in interface CanTokenizeWhitespace
public boolean getSplitAroundPeriods()
getSplitAroundPeriods
in interface CanSplitAroundPeriods
public void setSplitAroundPeriods(boolean splitAroundPeriods)
setSplitAroundPeriods
in interface CanSplitAroundPeriods
protected void createWordIterator()
public java.util.List<java.lang.String> extractWords(java.lang.String text)
extractWords
in interface WordTokenizer
extractWords
in class AbstractWordTokenizer
text
- Text to break into word tokens.