public class PennTreebankTokenizer extends AbstractWordTokenizer implements WordTokenizer
Based upon the sed script written by Robert McIntyre at http://www.cis.upenn.edu/~treebank/tokenizer.sed .
| Modifier and Type | Field and Description |
|---|---|
protected static java.util.List<PatternReplacer> |
pennPatterns
Replacement patterns for transforming original text.
|
abbreviations, aposTokens, apostropheCanBeQuote, coalesceAsterisks, coalesceHyphens, contractions, contractionsURL, hyphensMatcher, hyphensPattern, logger, preTokenizer| Constructor and Description |
|---|
PennTreebankTokenizer()
Create a simple word tokenizer.
|
| Modifier and Type | Method and Description |
|---|---|
java.util.List<java.lang.String> |
extractWords(java.lang.String text)
Break text into word tokens.
|
static java.lang.String |
prepareTextForTokenization(java.lang.String s) |
addWordToSentence, findWordOffsets, getLogger, getPreTokenizer, isClosingQuote, isLetterOrSingleQuote, isMultipleHyphens, isSingleOpeningQuote, loadContractions, preprocessToken, setAbbreviations, setAposTokens, setLogger, setPreTokenizer, splitTokencloseclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitaddWordToSentence, close, findWordOffsets, getPreTokenizer, preprocessToken, setAbbreviations, setAposTokens, setPreTokenizercloseprotected static java.util.List<PatternReplacer> pennPatterns
public PennTreebankTokenizer()
public static java.lang.String prepareTextForTokenization(java.lang.String s)
public java.util.List<java.lang.String> extractWords(java.lang.String text)
extractWords in interface WordTokenizerextractWords in class AbstractWordTokenizertext - Text to break into word tokens.Word tokens may be words, numbers, punctuation, etc.