Extracting words and sentences from a text are fundamental operations
required by other language processing functions.
Word tokenization splits a text into
words and punctuation marks.
assembles the tokenized text into sentences.
Recognizing the end of a sentence is not an easy task for a computer.
In English, punctuation marks that usually appear at the end of a sentence
may not indicate the end of a sentence. The period is the worst offender.
A period can end a sentence but it can also be part of an abbreviation or
acronym, an ellipsis, a decimal number, or part of a bracket of periods
surrounding a Roman numeral. A period can even act both as the end of an
abbreviation and the end of a sentence at the same time.
Other the other hand, some poems may not contain any sentence
punctuation at all.
Another problem punctuation mark is the single quote, which can introduce
a quote or start a contraction such as 'tis.
Leading-quote contractions are uncommon in contemporary English texts,
but appear frequently in Early Modern English texts.
Few literary texts which have already been marked up using SGML or XML
recognize sentences in the markup. (The Chadwick-Healey archive of
eighteenth century novels is a notable counterexample.)
Sentences often cross other element boundaries. Texts without
sentence markup require preprocessing to add it
without disturbing the existing markup.
This allows further processing of the texts, in particular,
part of speech tagging
and name recognition.
MorphAdorner allows pluggable input and output processors to handle
reification of texts and addition of extra markup as needed.
MorphAdorner's default sentence splitter uses the
ICU4JBreakIterator class along with a set of
heuristics for determining
if two or more sentences generated by ICU4JBreakIterator should be joined
into one sentence. The heuristics include special treatment of
sentence-ending brackets (right parenthesis, right bracket, and
right brace), abbreviations, and interjections. The resulting sentence
extraction is not perfect but is better than ICU4JBreakIterator's splitting
and much better than naive splitting methods.
You can try MorphAdorner's
default sentence splitter online.
This example only demonstrates sentence splitting for plain text.