Northwestern University Information Technology
The article Finding text boundaries in Java by Rich Gillam describes describes the Java BreakIterator which underlies the ICU4JBreakIterator class used by MorphAdorner to obtain an initial deconstruction of text into sentences. MorphAdorner only uses ICU4JBreakIterator to provide initial sentence boundaries. MorphAdorner's word tokenizer uses its own methods for determining token boundaries within a sentence.
The period ending an abbreviation may act as both a part of the abbreviation and the end of a sentence. MorphAdorner maintains a list of common abbreviations along with a flag indicating if the abbreviation usually can end a sentence. MorphAdorner will not split a sentence after an abbreviation which is not designated as a potential sentence ender.
For example, the abbreviation Mrs. rarely ends a sentence, so MorphAdorner does not issue sentence splits following Mrs. Thus
Mrs. Smith was here earlier.
is correctly considered a single sentence, while
I will leave it up to the Mrs. She will know what to do.
which should be two sentences (with a split after Mrs.) is also treated as a single sentence by MorphAdorner. This could be handled by recognizing that Mrs. can end a sentence when followed by something other than a proper name.
When an abbreviation can end a sentence, MorphAdorner tries to determine if a particular use ends a sentence or not by looking for possible verbs before and after the abbreviation. MorphAdorner does not split the sentence after the abbreviation unless it has found a possible verb in the sentence preceding the abbreviation. MorphAdorner does not use detailed part of speech information during sentence splitting. However, the parts of speech for any word can be looked up in the word lexicon or determined using a part of speech guesser. That is sufficient to guide the sentence splitting algorithm in many but not all cases.
MorphAdorner splits the text
correctly into two sentences following a.m., while
is left unsplit.
MorphAdorner correctly leaves unsplit the following sentences.
MorphAdorner correctly splits the following text into two sentences following p.m.:
should be left as a single sentence, but MorphAdorner splits it into two sentences with the split occurring after p.m. While both get and fixed can be verbs, neither appears in context as the the right kind of verb form to allow the text following p.m. to be considered a sentence.
MorphAdorner does not recognize abbreviations containing blanks, such as "U. S." for United States. However, "U.S." without the blank is recognized.
MorphAdorner does not allow a sentence to start with a comma, a period, or a percent sign. These characters will be attached to the previous token and/or sentence, if any. Dashes and hyphens are joined preferentially to the end of a sentence rather than the start of a sentence.
MorphAdorner maintains a list of common interjections, These are words typically used for emphasis, and generally followed by an exclamation mark or question mark. MorphAdorner does not split the sentence following the interjection, and it leaves the question mark or exclamation point attached to the interjection word. The situation can become ambiguous when quote marks are involved.
MorphAdorner treats the following lines as single sentences.
On the other hand, the following line is treated as two sentences.
"What!" is the first sentence and "That's bad!" is the second sentence.
A period following a number may act as both a decimal point and the end of a sentence (in English). In general, MorphAdorner ends a sentence following a number ending in a period when the next word begins with a capital letter. The following text is considered one sentence by MorphAdorner.
MorphAdorner splits each of the following two lines into two sentences following 12.
|Announcements and News
|Announcements and news about changes to MorphAdorner
|Documentation for using MorphAdorner
|Downloading and installing the MorphAdorner client and server software
|Glossary of MorphAdorner terms
|Natural language processing references
|Licenses for MorphAdorner and Associated Software
|Online examples of MorphAdorner Server facilities.
|Slides from talks about MorphAdorner.
|Technical information for programmers using MorphAdorner
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |