Package edu.northwestern.at.morphadorner.corpuslinguistics.sentencesplitter

Splits text into sentences.

See: Description

Package edu.northwestern.at.morphadorner.corpuslinguistics.sentencesplitter Description

Splits text into sentences.

Extracting words and sentences from a text are fundamental operations required by other language processing functions. Word tokenization splits a text into words and punctuation marks. Sentence splitting assembles the tokenized text into sentences.

Recognizing the end of a sentence is not an easy task for a computer. In English, punctuation marks that usually appear at the end of a sentence may not indicate the end of a sentence. The period is the worst offender. A period can end a sentence but it can also be part of an abbreviation or acronym, an ellipsis, a decimal number, or part of a bracket of periods surrounding a Roman numeral. A period can even act both as the end of an abbreviation and the end of a sentence at the same time. Other the other hand, some poems may not contain any sentence punctuation at all.

Another problem punctuation mark is the single quote, which can introduce a quote or start a contraction such as 'tis. Leading-quote contractions are uncommon in contemporary English texts, but appear frequently in Early Modern English texts.

Few literary texts which have already been marked up using SGML or XML recognize sentences in the markup. (The Chadwick-Healey archive of eighteenth century novels is a notable counterexample.) Sentences often cross other element boundaries. Texts without sentence markup require preprocessing to add it without disturbing the existing markup. This allows further processing of the texts, in particular, part of speech tagging and name recognition. MorphAdorner allows pluggable input and output processors to handle reification of texts and addition of extra markup as needed.

MorphAdorner's default sentence splitter uses the standard Java BreakIterator class along with a set of heuristics for determining if two or more sentences generated by BreakIterator should be joined into one sentence. The heuristics include special treatment of sentence-ending brackets (right parenthesis, right bracket, and right brace) and abbreviations. The resulting sentence extraction is not perfect but is better than BreakIterator's splitting and much better than naive splitting methods.

All MorphAdorner sentence splitters must implement the SentenceSplitter interface. The SentenceSplitterFactory provides the mechanism for instantiating a default or specified instance of a SentenceSplitter implementation. The AbstractSentenceSplitter serves as a base class for deriving concrete implementations of sentence splitters.