NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Sentence Splitter Heuristics

The article Finding text boundaries in Java by Rich Gillam describes describes the Java BreakIterator which underlies the ICU4JBreakIterator class used by MorphAdorner to obtain an initial deconstruction of text into sentences. MorphAdorner only uses ICU4JBreakIterator to provide initial sentence boundaries. MorphAdorner's word tokenizer uses its own methods for determining token boundaries within a sentence.

Abbreviations

The period ending an abbreviation may act as both a part of the abbreviation and the end of a sentence. MorphAdorner maintains a list of common abbreviations along with a flag indicating if the abbreviation usually can end a sentence. MorphAdorner will not split a sentence after an abbreviation which is not designated as a potential sentence ender.

For example, the abbreviation Mrs. rarely ends a sentence, so MorphAdorner does not issue sentence splits following Mrs. Thus

Mrs. Smith was here earlier.

is correctly considered a single sentence, while

I will leave it up to the Mrs. She will know what to do.

which should be two sentences (with a split after Mrs.) is also treated as a single sentence by MorphAdorner. This could be handled by recognizing that Mrs. can end a sentence when followed by something other than a proper name.

When an abbreviation can end a sentence, MorphAdorner tries to determine if a particular use ends a sentence or not by looking for possible verbs before and after the abbreviation. MorphAdorner does not split the sentence after the abbreviation unless it has found a possible verb in the sentence preceding the abbreviation. MorphAdorner does not use detailed part of speech information during sentence splitting. However, the parts of speech for any word can be looked up in the word lexicon or determined using a part of speech guesser. That is sufficient to guide the sentence splitting algorithm in many but not all cases.

MorphAdorner splits the text

I mailed the letter early in the a.m. The next step is to wait for a reply.

correctly into two sentences following a.m., while

I mailed the letter early in the a.m. the next day too.

is left unsplit.

MorphAdorner correctly leaves unsplit the following sentences.

She needs her car by 5 p.m. Saturday evening. At 5 p.m. I had to go to the bank. She has an appointment at 5 p.m. Saturday afternoon. By 5 p.m. Sunday I have to be at home.

MorphAdorner correctly splits the following text into two sentences following p.m.:

It was due Friday at 5 p.m. Saturday afternoon would be too late.

The text

She has an appointment at 5 p.m. Saturday afternoon to get her car fixed.

should be left as a single sentence, but MorphAdorner splits it into two sentences with the split occurring after p.m. While both get and fixed can be verbs, neither appears in context as the the right kind of verb form to allow the text following p.m. to be considered a sentence.

MorphAdorner does not recognize abbreviations containing blanks, such as "U. S." for United States. However, "U.S." without the blank is recognized.

Characters not allowed to start a sentence

MorphAdorner does not allow a sentence to start with a comma, a period, or a percent sign. These characters will be attached to the previous token and/or sentence, if any. Dashes and hyphens are joined preferentially to the end of a sentence rather than the start of a sentence.

Interjections

MorphAdorner maintains a list of common interjections, These are words typically used for emphasis, and generally followed by an exclamation mark or question mark. MorphAdorner does not split the sentence following the interjection, and it leaves the question mark or exclamation point attached to the interjection word. The situation can become ambiguous when quote marks are involved.

MorphAdorner treats the following lines as single sentences.

What! That's bad! "What! That's bad!"

On the other hand, the following line is treated as two sentences.

"What!" "That's bad!"

"What!" is the first sentence and "That's bad!" is the second sentence.

Numbers

A period following a number may act as both a decimal point and the end of a sentence (in English). In general, MorphAdorner ends a sentence following a number ending in a period when the next word begins with a capital letter. The following text is considered one sentence by MorphAdorner.

There are 12. of them.

MorphAdorner splits each of the following two lines into two sentences following 12.

There are 12. More would be unnecessary. There are 12. "More would be unnecessary."

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk