public abstract class AbstractSentenceSplitter extends IsCloseableObject implements SentenceSplitter, IsCloseable, UsesLogger
The base class for sentence splitters.
Modifier and Type | Field and Description |
---|---|
protected Abbreviations |
abbreviations
Abbreviations.
|
protected static java.lang.String |
disallowedSentenceStarters
Characters not allowed to start a sentence.
|
protected Logger |
logger
Logger used for output.
|
protected Names |
names
Name recognizer.
|
protected PartOfSpeechGuesser |
partOfSpeechGuesser
Part of speech guesser used by some sentence splitters.
|
protected SentenceSplitterIterator |
sentenceSplitterIterator
Sentence iterator.
|
protected WordTokenizer |
wordTokenizer
Default word tokenizer used if none specified.
|
Constructor and Description |
---|
AbstractSentenceSplitter() |
Modifier and Type | Method and Description |
---|---|
protected void |
addSentence(java.util.List<java.lang.String> sentence,
java.util.List<java.util.List<java.lang.String>> sentenceList)
Add sentence to sentence list.
|
protected void |
addSentenceBad(java.util.List<java.lang.String> sentence,
java.util.List<java.util.List<java.lang.String>> sentenceList)
Add sentence to sentence list.
|
java.util.List<java.util.List<java.lang.String>> |
extractSentences(java.lang.String text)
Break text into sentences and tokens.
|
java.util.List<java.util.List<java.lang.String>> |
extractSentences(java.lang.String text,
WordTokenizer tokenizer)
Break text into sentences and tokens.
|
int[] |
findSentenceOffsets(java.lang.String text,
java.util.List<java.util.List<java.lang.String>> sentences)
Find starting offsets of sentences extracted from a text.
|
protected boolean |
fixUpSentence(java.util.List<java.lang.String> sentenceWords,
java.util.List<java.lang.String> previousSentenceWords)
Fix up a sentence.
|
Logger |
getLogger()
Get the logger.
|
boolean |
isClosingPunctuationOnly(java.util.List<java.lang.String> sentenceWords)
Check if sentence contains only closing punctuation.
|
protected boolean |
isNoun(java.lang.String word)
Check if word is a possible noun.
|
protected boolean |
isPronoun(java.lang.String word)
Check if word is a possible pronoun.
|
protected boolean |
isProperNoun(java.lang.String word)
Check if word is a possible proper noun.
|
protected boolean |
isVerb(java.lang.String word)
Check if word is a possible verb.
|
boolean |
quoteOnlySentence(java.util.List<java.lang.String> sentenceWords)
Check if sentence contains only a double quote.
|
void |
setAbbreviations(Abbreviations abbreviations)
Set abbreviations.
|
void |
setLogger(Logger logger)
Set the logger.
|
void |
setPartOfSpeechGuesser(PartOfSpeechGuesser partOfSpeechGuesser)
Set the part of speech guesser.
|
void |
setSentenceSplitterIterator(SentenceSplitterIterator sentenceSplitterIterator)
Set sentence splitter iterator.
|
protected java.util.List<java.util.List<java.lang.String>> |
splitSentenceWordList(java.util.List<java.lang.String> sentenceWords)
Break sentence word list into subsentences on special marker.
|
protected boolean |
verbSeen(java.util.List<java.lang.String> tokenList)
See if potential verb found in token list.
|
close
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
close
protected WordTokenizer wordTokenizer
protected PartOfSpeechGuesser partOfSpeechGuesser
protected SentenceSplitterIterator sentenceSplitterIterator
protected Names names
protected Abbreviations abbreviations
protected Logger logger
protected static final java.lang.String disallowedSentenceStarters
public void setPartOfSpeechGuesser(PartOfSpeechGuesser partOfSpeechGuesser)
setPartOfSpeechGuesser
in interface SentenceSplitter
partOfSpeechGuesser
- The part of speech guesser.public void setAbbreviations(Abbreviations abbreviations)
setAbbreviations
in interface SentenceSplitter
abbreviations
- Abbreviations.public void setSentenceSplitterIterator(SentenceSplitterIterator sentenceSplitterIterator)
setSentenceSplitterIterator
in interface SentenceSplitter
sentenceSplitterIterator
- Sentence splitter iterator.protected boolean fixUpSentence(java.util.List<java.lang.String> sentenceWords, java.util.List<java.lang.String> previousSentenceWords)
sentenceWords
- Sentence to fix up.previousSentenceWords
- Previous sentence.public boolean isClosingPunctuationOnly(java.util.List<java.lang.String> sentenceWords)
sentenceWords
- Words in sentence.public java.util.List<java.util.List<java.lang.String>> extractSentences(java.lang.String text, WordTokenizer tokenizer)
extractSentences
in interface SentenceSplitter
text
- Text to break into sentences and tokens.tokenizer
- Word tokenizer to use for breaking sentences
into words.Word tokens may be words, numbers, punctuation, etc.
public boolean quoteOnlySentence(java.util.List<java.lang.String> sentenceWords)
sentenceWords
- List of sentence words.public java.util.List<java.util.List<java.lang.String>> extractSentences(java.lang.String text)
extractSentences
in interface SentenceSplitter
text
- Text to break into sentences and tokens.Word tokens may be words, numbers, punctuation, etc. The default word tokenizer is used.
public int[] findSentenceOffsets(java.lang.String text, java.util.List<java.util.List<java.lang.String>> sentences)
findSentenceOffsets
in interface SentenceSplitter
text
- Text from which sentences were
extracted.sentences
- List of sentences (each a list of
words) extracted from text.
N.B. If the sentences aren't from
the specified text, the resulting
offsets will be meaningless.protected void addSentenceBad(java.util.List<java.lang.String> sentence, java.util.List<java.util.List<java.lang.String>> sentenceList)
sentence
- List of words in sentence.sentenceList
- List of sentences.
The sentence is added to the sentence list after performing any further sentence splitting.
protected void addSentence(java.util.List<java.lang.String> sentence, java.util.List<java.util.List<java.lang.String>> sentenceList)
sentence
- List of words in sentence.sentenceList
- List of sentences.protected boolean isVerb(java.lang.String word)
word
- The word to check.
The check uses the part of speech guesser to get the parts of speech for word. The list of parts of speech is checked for an entry with a major word class of verb.
protected boolean isProperNoun(java.lang.String word)
word
- The word to check.
The check first looks up the word in the MorphAdorner lists of proper names and places. If the word is not found there, the part of speech guesser is used to get the parts of speech for the word. If the word can be a proper noun, or at least a noun and it begins with a capital letter, the word is assumed to be a possible proper noun.
protected boolean isPronoun(java.lang.String word)
word
- The word to check.protected boolean isNoun(java.lang.String word)
word
- The word to check.protected java.util.List<java.util.List<java.lang.String>> splitSentenceWordList(java.util.List<java.lang.String> sentenceWords)
sentenceWords
- Sentence words as a list.Breaks the word list for a sentence into subsentences based upon the occurrence of the XML surround text marker character. This special character always indicates where a list of words should be split. Also split at a bare period assuming it ends a sentence. We may rejoin some of the split subsentences later into longer sentences.
protected boolean verbSeen(java.util.List<java.lang.String> tokenList)
tokenList
- The token list.public Logger getLogger()
getLogger
in interface UsesLogger
public void setLogger(Logger logger)
setLogger
in interface UsesLogger
logger
- The logger.