NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Finding Sentence and Token Offsets

Finding sentence and token offsets in plain text

You may want to locate word and sentence boundaries as a first step in text processing. Here we produce a program called SentenceAndTokenOffsets to find such boundaries and locate the character offsets of each sentence and word as well.

First you need to break up the text into sentences and words. In MorphAdorner you use a sentence splitter and a word tokenizer to perform these tasks. You can use MorphAdorner's default sentence splitter and default word tokenizer by creating an instance of each as follows.

        WordTokenizer wordTokenizer = new DefaultWordTokenizer();
        SentenceSplitter sentenceSplitter   =
            new DefaultSentenceSplitter();

Note that the sentence splitter requires the word tokenizer as a parameter.

To improve the accuracy of the sentence splitter you can create a part of speech guesser using the default word lexicon and default suffix lexicon.

                                //  Create part of speech guesser
                                //  for use by splitter.
        PartOfSpeechGuesser partOfSpeechGuesser =
            new DefaultPartOfSpeechGuesser();
                                //  Get default word lexicon for
                                //  use by part of speech guesser.
        Lexicon lexicon = new DefaultWordLexicon();
                                //  Set lexicon into guesser.
        partOfSpeechGuesser.setWordLexicon( lexicon );
                                //  Get default suffix lexicon for
                                //  use by part of speech guesser.
        Lexicon suffixLexicon       = new DefaultSuffixLexicon();
                                //  Set suffix lexicon into guesser.
        partOfSpeechGuesser.setSuffixLexicon( suffixLexicon );
                                //  Set guesser into sentence splitter.
        splitter.setPartOfSpeechGuesser( partOfSpeechGuesser );

Sample text: Lincoln's Gettysburg Address

Let's use Abraham Lincoln's "Gettysburg Address" as a sample text.

Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation, so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government : of the people, by the people, for the people, shall not perish from the earth.

Sample output

Place that text into a utf-8 text file called gettysburg.txt . You can use a MorphAdorner utility method to read the text. You may want to convert all the whitespace characters into blanks for legibility and to avoid problems with platform specific end of line characters.

                                //  Load text to split into
                                //  sentences and tokens.
        String sampleText   =
            FileUtils.readTextFile( inputFileName , "utf-8" );
                                //  Convert all whitespace characters
                                //  into blanks.  (Not necessary,
                                //  but makes the display cleaner below.)
        sampleText  = sampleText.replaceAll( "\\s" , " " );

Use the sentence splitter and word tokenizer to split the text into a java.util.List of sentences, each of which is in turn a java.util.List of word and punctuation tokens.

        List<List<String>> sentences    =
            sentenceSplitter.extractSentences(
                textToAdorn , wordTokenizer );

Next use the findSentenceOffsets method provided by the sentence splitter to get the list of sentence offsets. You can use these to find the end of each sentence as well.

                                //  Get sentence start and end
                                //  offsets in input text.
        int[] sentenceOffsets   =
            splitter.findSentenceOffsets( sampleText , sentences );

Within each sentence you can use the tokenizer method findWordOffsets to locate the start of each token in a sentence relative to the start of the sentence.

                                //  Get offsets for each word token
                                //  relative to this sentence.
         int[] wordOffsets   =
             tokenizer.findWordOffsets( sentence , words  );

Putting it altogether

You can peruse the Java source code for SentenceAndTokenOffsets which puts all the above code together in a runnable sample program. You will also find the source code in the src/edu/northwestern/at/morphadorner/examples/ directory in the MorphAdorner release.

Running the program

Executing SentenceAndTokenOffsets with the Gettysburg Address text as input produces the output below. Only show the first two sentences are shown. Long output lines have been folded.

Each sentence and word token is preceded with an ordinal starting at 0, followed by starting and ending character offsets in brackets. For example:

  • Sentence ordinal 0 starts at character 0 and ends at character 174.
  • Word ordinal 0 starts at character 0 and ends at character 3.

The word offsets are relative to the start of the sentence. Consider the word at ordinal 1 in sentence ordinal 1, "we", which starts at character position 6 relative to the start of the sentence. Its absolute character offset is 175 (the offset of sentence 1) + 6 or 181.

0 [0,174]: Four score and seven years ago our fathers brought forth
 on this continent a new nation, conceived in Liberty, and dedicated 
 to the proposition that all men are created equal.
          0 [0,3]: Four
          1 [5,9]: score
          2 [11,13]: and
          3 [15,19]: seven
          4 [21,25]: years
          5 [27,29]: ago
          6 [31,33]: our
          7 [35,41]: fathers
          8 [43,49]: brought
          9 [51,55]: forth
          10 [57,58]: on
          11 [60,63]: this
          12 [65,73]: continent
          13 [75,75]: a
          14 [77,79]: new
          15 [81,86]: nation
          16 [87,87]: ,
          17 [89,97]: conceived
          18 [99,100]: in
          19 [102,108]: Liberty
          20 [109,109]: ,
          21 [111,113]: and
          22 [115,123]: dedicated
          23 [125,126]: to
          24 [128,130]: the
          25 [132,142]: proposition
          26 [144,147]: that
          27 [149,151]: all
          28 [153,155]: men
          29 [157,159]: are
          30 [161,167]: created
          31 [169,173]: equal
          32 [174,174]: .
1 [175,308]:   Now we are engaged in a great civil war,
 testing whether that nation, or any nation, so conceived and 
so dedicated, can long endure.
          0 [2,4]: Now
          1 [6,7]: we
          2 [9,11]: are
          3 [13,19]: engaged
          4 [21,22]: in
          5 [24,24]: a
          6 [26,30]: great
          7 [32,36]: civil
          8 [38,40]: war
          9 [41,41]: ,
          10 [43,49]: testing
          11 [51,57]: whether
          12 [59,62]: that
          13 [64,69]: nation
          14 [70,70]: ,
          15 [72,73]: or
          16 [75,77]: any
          17 [79,84]: nation
          18 [85,85]: ,
          19 [87,88]: so
          20 [90,98]: conceived
          21 [100,102]: and
          22 [104,105]: so
          23 [107,115]: dedicated
          24 [116,116]: ,
          25 [118,120]: can
          26 [122,125]: long
          27 [127,132]: endure
          28 [133,133]: .

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk