AbstractSentenceSplitter (MorphAdorner)

java.lang.Object
- edu.northwestern.at.utils.IsCloseableObject
- - edu.northwestern.at.morphadorner.corpuslinguistics.sentencesplitter.AbstractSentenceSplitter

All Implemented Interfaces:

SentenceSplitter, IsCloseable, UsesLogger

Direct Known Subclasses:

ICU4JBreakIteratorSentenceSplitter
```
public abstract class AbstractSentenceSplitter
extends IsCloseableObject
implements SentenceSplitter, IsCloseable, UsesLogger
```
Abstract sentence splitter.
The base class for sentence splitters.

Field Summary

Fields
Modifier and Type	Field and Description
`protected Abbreviations`	`abbreviations` Abbreviations.
`protected static java.lang.String`	`disallowedSentenceStarters` Characters not allowed to start a sentence.
`protected Logger`	`logger` Logger used for output.
`protected Names`	`names` Name recognizer.
`protected PartOfSpeechGuesser`	`partOfSpeechGuesser` Part of speech guesser used by some sentence splitters.
`protected SentenceSplitterIterator`	`sentenceSplitterIterator` Sentence iterator.
`protected WordTokenizer`	`wordTokenizer` Default word tokenizer used if none specified.

Constructor Summary

Constructors
Constructor and Description

AbstractSentenceSplitter()

Constructors
Constructor and Description
`AbstractSentenceSplitter()`

Method Summary

Methods
Modifier and Type	Method and Description
`protected void`	`addSentence(java.util.List<java.lang.String> sentence, java.util.List<java.util.List<java.lang.String>> sentenceList)` Add sentence to sentence list.
`protected void`	`addSentenceBad(java.util.List<java.lang.String> sentence, java.util.List<java.util.List<java.lang.String>> sentenceList)` Add sentence to sentence list.
`java.util.List<java.util.List<java.lang.String>>`	`extractSentences(java.lang.String text)` Break text into sentences and tokens.
`java.util.List<java.util.List<java.lang.String>>`	`extractSentences(java.lang.String text, WordTokenizer tokenizer)` Break text into sentences and tokens.
`int[]`	`findSentenceOffsets(java.lang.String text, java.util.List<java.util.List<java.lang.String>> sentences)` Find starting offsets of sentences extracted from a text.
`protected boolean`	`fixUpSentence(java.util.List<java.lang.String> sentenceWords, java.util.List<java.lang.String> previousSentenceWords)` Fix up a sentence.
`Logger`	`getLogger()` Get the logger.
`boolean`	`isClosingPunctuationOnly(java.util.List<java.lang.String> sentenceWords)` Check if sentence contains only closing punctuation.
`protected boolean`	`isNoun(java.lang.String word)` Check if word is a possible noun.
`protected boolean`	`isPronoun(java.lang.String word)` Check if word is a possible pronoun.
`protected boolean`	`isProperNoun(java.lang.String word)` Check if word is a possible proper noun.
`protected boolean`	`isVerb(java.lang.String word)` Check if word is a possible verb.
`boolean`	`quoteOnlySentence(java.util.List<java.lang.String> sentenceWords)` Check if sentence contains only a double quote.
`void`	`setAbbreviations(Abbreviations abbreviations)` Set abbreviations.
`void`	`setLogger(Logger logger)` Set the logger.
`void`	`setPartOfSpeechGuesser(PartOfSpeechGuesser partOfSpeechGuesser)` Set the part of speech guesser.
`void`	`setSentenceSplitterIterator(SentenceSplitterIterator sentenceSplitterIterator)` Set sentence splitter iterator.
`protected java.util.List<java.util.List<java.lang.String>>`	`splitSentenceWordList(java.util.List<java.lang.String> sentenceWords)` Break sentence word list into subsentences on special marker.
`protected boolean`	`verbSeen(java.util.List<java.lang.String> tokenList)` See if potential verb found in token list.

Methods inherited from class edu.northwestern.at.utils.IsCloseableObject
close

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface edu.northwestern.at.utils.IsCloseable
close

- Field Detail
  - wordTokenizer
```
protected WordTokenizer wordTokenizer
```
    Default word tokenizer used if none specified.
  - partOfSpeechGuesser
```
protected PartOfSpeechGuesser partOfSpeechGuesser
```
    Part of speech guesser used by some sentence splitters.
  - sentenceSplitterIterator
```
protected SentenceSplitterIterator sentenceSplitterIterator
```
    Sentence iterator.
  - names
```
protected Names names
```
    Name recognizer.
  - abbreviations
```
protected Abbreviations abbreviations
```
    Abbreviations.
  - logger
```
protected Logger logger
```
    Logger used for output.
  - disallowedSentenceStarters
```
protected static final java.lang.String disallowedSentenceStarters
```
    Characters not allowed to start a sentence.
    
    See Also:
    Constant Field Values
- Constructor Detail
  - AbstractSentenceSplitter
```
public AbstractSentenceSplitter()
```
- Method Detail
  - setPartOfSpeechGuesser
```
public void setPartOfSpeechGuesser(PartOfSpeechGuesser partOfSpeechGuesser)
```
    Set the part of speech guesser.
    
    Specified by:
    
    setPartOfSpeechGuesser in interface SentenceSplitter
    
    Parameters:
    partOfSpeechGuesser - The part of speech guesser.
  - setAbbreviations
```
public void setAbbreviations(Abbreviations abbreviations)
```
    Set abbreviations.
    
    Specified by:
    
    setAbbreviations in interface SentenceSplitter
    
    Parameters:
    abbreviations - Abbreviations.
  - setSentenceSplitterIterator
```
public void setSentenceSplitterIterator(SentenceSplitterIterator sentenceSplitterIterator)
```
    Set sentence splitter iterator.
    
    Specified by:
    
    setSentenceSplitterIterator in interface SentenceSplitter
    
    Parameters:
    sentenceSplitterIterator - Sentence splitter iterator.
  - fixUpSentence
```
protected boolean fixUpSentence(java.util.List<java.lang.String> sentenceWords,
                    java.util.List<java.lang.String> previousSentenceWords)
```
    Fix up a sentence.
    
    Parameters:
    sentenceWords - Sentence to fix up.
    previousSentenceWords - Previous sentence.
    
    Returns:
    true if end of sentence found.
  - isClosingPunctuationOnly
```
public boolean isClosingPunctuationOnly(java.util.List<java.lang.String> sentenceWords)
```
    Check if sentence contains only closing punctuation.
    
    Parameters:
    sentenceWords - Words in sentence.
    
    Returns:
    true if all words are closing punctuation, false otherwise.
  - extractSentences
```
public java.util.List<java.util.List<java.lang.String>> extractSentences(java.lang.String text,
                                                                WordTokenizer tokenizer)
```
    Break text into sentences and tokens.
    
    Specified by:
    
    extractSentences in interface SentenceSplitter
    
    Parameters:
    text - Text to break into sentences and tokens.
    tokenizer - Word tokenizer to use for breaking sentences into words.
    
    Returns:
    List of sentences. Each sentence is itself a list of word tokens.
    Word tokens may be words, numbers, punctuation, etc.
  - quoteOnlySentence
```
public boolean quoteOnlySentence(java.util.List<java.lang.String> sentenceWords)
```
    Check if sentence contains only a double quote.
    
    Parameters:
    sentenceWords - List of sentence words.
    
    Returns:
    true if sentence starts with quote and contains only XML section marker characters.
  - extractSentences
```
public java.util.List<java.util.List<java.lang.String>> extractSentences(java.lang.String text)
```
    Break text into sentences and tokens.
    
    Specified by:
    
    extractSentences in interface SentenceSplitter
    
    Parameters:
    text - Text to break into sentences and tokens.
    
    Returns:
    List of sentences. Each sentence is itself a list of word tokens.
    Word tokens may be words, numbers, punctuation, etc. The default word tokenizer is used.
  - findSentenceOffsets
```
public int[] findSentenceOffsets(java.lang.String text,
                        java.util.List<java.util.List<java.lang.String>> sentences)
```
    Find starting offsets of sentences extracted from a text.
    
    Specified by:
    
    findSentenceOffsets in interface SentenceSplitter
    
    Parameters:
    text - Text from which sentences were extracted.
    sentences - List of sentences (each a list of words) extracted from text. N.B. If the sentences aren't from the specified text, the resulting offsets will be meaningless.
    
    Returns:
    int array of starting offsets in text for each sentence. The first offset starts at 0. There is one more offset than the number of sentences -- the last offset is where the sentence after the last sentence would start.
  - addSentenceBad
```
protected void addSentenceBad(java.util.List<java.lang.String> sentence,
                  java.util.List<java.util.List<java.lang.String>> sentenceList)
```
    Add sentence to sentence list.
    
    Parameters:
    sentence - List of words in sentence.
    sentenceList - List of sentences.
    The sentence is added to the sentence list after performing any further sentence splitting.
  - addSentence
```
protected void addSentence(java.util.List<java.lang.String> sentence,
               java.util.List<java.util.List<java.lang.String>> sentenceList)
```
    Add sentence to sentence list.
    
    Parameters:
    sentence - List of words in sentence.
    sentenceList - List of sentences.
  - isVerb
```
protected boolean isVerb(java.lang.String word)
```
    Check if word is a possible verb.
    
    Parameters:
    word - The word to check.
    The check uses the part of speech guesser to get the parts of speech for word. The list of parts of speech is checked for an entry with a major word class of verb.
  - isProperNoun
```
protected boolean isProperNoun(java.lang.String word)
```
    Check if word is a possible proper noun.
    
    Parameters:
    word - The word to check.
    The check first looks up the word in the MorphAdorner lists of proper names and places. If the word is not found there, the part of speech guesser is used to get the parts of speech for the word. If the word can be a proper noun, or at least a noun and it begins with a capital letter, the word is assumed to be a possible proper noun.
  - isPronoun
```
protected boolean isPronoun(java.lang.String word)
```
    Check if word is a possible pronoun.
    
    Parameters:
    word - The word to check.
  - isNoun
```
protected boolean isNoun(java.lang.String word)
```
    Check if word is a possible noun.
    
    Parameters:
    word - The word to check.
  - splitSentenceWordList
```
protected java.util.List<java.util.List<java.lang.String>> splitSentenceWordList(java.util.List<java.lang.String> sentenceWords)
```
    Break sentence word list into subsentences on special marker.
    
    Parameters:
    sentenceWords - Sentence words as a list.
    
    Returns:
    List of lists of sentence words.
    Breaks the word list for a sentence into subsentences based upon the occurrence of the XML surround text marker character. This special character always indicates where a list of words should be split. Also split at a bare period assuming it ends a sentence. We may rejoin some of the split subsentences later into longer sentences.
  - verbSeen
```
protected boolean verbSeen(java.util.List<java.lang.String> tokenList)
```
    See if potential verb found in token list.
    
    Parameters:
    tokenList - The token list.
    
    Returns:
    True if at least one of the tokens in the token list is a potential verb.
  - getLogger
```
public Logger getLogger()
```
    Get the logger.
    
    Specified by:
    
    getLogger in interface UsesLogger
    
    Returns:
    The logger.
  - setLogger
```
public void setLogger(Logger logger)
```
    Set the logger.
    
    Specified by:
    
    setLogger in interface UsesLogger
    
    Parameters:
    logger - The logger.

Class AbstractSentenceSplitter

Field Summary

Constructor Summary

Method Summary

Methods inherited from class edu.northwestern.at.utils.IsCloseableObject

Methods inherited from class java.lang.Object

Methods inherited from interface edu.northwestern.at.utils.IsCloseable

Field Detail

wordTokenizer

partOfSpeechGuesser

sentenceSplitterIterator

names

abbreviations

logger

disallowedSentenceStarters

Constructor Detail

AbstractSentenceSplitter

Method Detail

setPartOfSpeechGuesser

setAbbreviations

setSentenceSplitterIterator

fixUpSentence

isClosingPunctuationOnly

extractSentences

quoteOnlySentence

extractSentences

findSentenceOffsets

addSentenceBad

addSentence

isVerb

isProperNoun

isPronoun

isNoun

splitSentenceWordList

verbSeen

getLogger

setLogger