edu.northwestern.at.utils.corpuslinguistics.sentencesplitter
Interface SentenceSplitter

All Known Implementing Classes:
AbstractSentenceSplitter, BreakIteratorSentenceSplitter, DefaultSentenceSplitter, ICU4JBreakIteratorSentenceSplitter

public interface SentenceSplitter

Interface for splitting text into sentences.


Method Summary
 java.util.List<java.util.List<java.lang.String>> extractSentences(java.lang.String text)
          Break text into sentences and tokens.
 java.util.List<java.util.List<java.lang.String>> extractSentences(java.lang.String text, WordTokenizer tokenizer)
          Break text into sentences and tokens.
 int[] findSentenceOffsets(java.lang.String text, java.util.List<java.util.List<java.lang.String>> sentences)
          Find starting offsets of sentences extracted from a text.
 void setPartOfSpeechGuesser(PartOfSpeechGuesser partOfSpeechGuesser)
          Set part of speech guesser.
 void setSentenceSplitterIterator(SentenceSplitterIterator sentenceSplitterIterator)
          Set sentence splitter iterator.
 

Method Detail

setPartOfSpeechGuesser

void setPartOfSpeechGuesser(PartOfSpeechGuesser partOfSpeechGuesser)
Set part of speech guesser.

Parameters:
partOfSpeechGuesser - Part of speech guesser.

A sentence splitter may use part of speech information to disambiguate end-of-sentence boundary conditions. The part of speech guesser provides access to the lexicons and guessing algorithms for determining the possible parts of speech for a word without performing a full part of speech tagging operation.


setSentenceSplitterIterator

void setSentenceSplitterIterator(SentenceSplitterIterator sentenceSplitterIterator)
Set sentence splitter iterator.

Parameters:
sentenceSplitterIterator - Sentence splitter iterator.

extractSentences

java.util.List<java.util.List<java.lang.String>> extractSentences(java.lang.String text,
                                                                  WordTokenizer tokenizer)
Break text into sentences and tokens.

Parameters:
text - Text to break into sentences and tokens.
tokenizer - Tokenizer to use for breaking sentences into words.
Returns:
List of sentences. Each sentence is itself a list of word tokens.

Word tokens may be words, numbers, punctuation, etc.


extractSentences

java.util.List<java.util.List<java.lang.String>> extractSentences(java.lang.String text)
Break text into sentences and tokens.

Parameters:
text - Text to break into sentences and tokens.
Returns:
List of sentences. Each sentence is itself a list of word tokens.

Word tokens may be words, numbers, punctuation, etc. The default word tokenizer is used.


findSentenceOffsets

int[] findSentenceOffsets(java.lang.String text,
                          java.util.List<java.util.List<java.lang.String>> sentences)
Find starting offsets of sentences extracted from a text.

Parameters:
text - Text from which sentences were extracted.
sentences - List of sentences (each a list of words) extracted from text. N.B. If the sentences aren't from the specified text, the resulting offsets will be meaningless.
Returns:
int array of starting offsets in text for each sentence. The first offset starts at 0. There is one more offset than the number of sentences -- the last offset is where the sentence after the last sentence would start.