MorphAdorner: Finding Sentence and Token Offsets

Finding Sentence and Token Offsets

Finding sentence and token offsets in plain text

You may want to locate word and sentence boundaries as a first step in text processing. Here we produce a program called SentenceAndTokenOffsets to find such boundaries and locate the character offsets of each sentence and word as well.

First you need to break up the text into sentences and words. In MorphAdorner you use a sentence splitter and a word tokenizer to perform these tasks. You can use MorphAdorner's default sentence splitter and default word tokenizer by creating an instance of each as follows.

        WordTokenizer wordTokenizer = new DefaultWordTokenizer();
        SentenceSplitter sentenceSplitter   =
            new DefaultSentenceSplitter();

Note that the sentence splitter requires the word tokenizer as a parameter.

To improve the accuracy of the sentence splitter you can create a part of speech guesser using the default word lexicon and default suffix lexicon.

                                //  Create part of speech guesser
                                //  for use by splitter.
        PartOfSpeechGuesser partOfSpeechGuesser =
            new DefaultPartOfSpeechGuesser();
                                //  Get default word lexicon for
                                //  use by part of speech guesser.
        Lexicon lexicon = new DefaultWordLexicon();
                                //  Set lexicon into guesser.
        partOfSpeechGuesser.setWordLexicon( lexicon );
                                //  Get default suffix lexicon for
                                //  use by part of speech guesser.
        Lexicon suffixLexicon       = new DefaultSuffixLexicon();
                                //  Set suffix lexicon into guesser.
        partOfSpeechGuesser.setSuffixLexicon( suffixLexicon );
                                //  Set guesser into sentence splitter.
        splitter.setPartOfSpeechGuesser( partOfSpeechGuesser );

Sample text: Lincoln's Gettysburg Address

Let's use Abraham Lincoln's "Gettysburg Address" as a sample text.

Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation, so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government : of the people, by the people, for the people, shall not perish from the earth.

Sample output

Place that text into a utf-8 text file called gettysburg.txt . You can use a MorphAdorner utility method to read the text. You may want to convert all the whitespace characters into blanks for legibility and to avoid problems with platform specific end of line characters.

                                //  Load text to split into
                                //  sentences and tokens.
        String sampleText   =
            FileUtils.readTextFile( inputFileName , "utf-8" );
                                //  Convert all whitespace characters
                                //  into blanks.  (Not necessary,
                                //  but makes the display cleaner below.)
        sampleText  = sampleText.replaceAll( "\\s" , " " );

Use the sentence splitter and word tokenizer to split the text into a java.util.List of sentences, each of which is in turn a java.util.List of word and punctuation tokens.

        List<List<String>> sentences    =
            sentenceSplitter.extractSentences(
                textToAdorn , wordTokenizer );

Next use the findSentenceOffsets method provided by the sentence splitter to get the list of sentence offsets. You can use these to find the end of each sentence as well.

                                //  Get sentence start and end
                                //  offsets in input text.
        int[] sentenceOffsets   =
            splitter.findSentenceOffsets( sampleText , sentences );

Within each sentence you can use the tokenizer method findWordOffsets to locate the start of each token in a sentence relative to the start of the sentence.

                                //  Get offsets for each word token
                                //  relative to this sentence.
         int[] wordOffsets   =
             tokenizer.findWordOffsets( sentence , words  );

Putting it altogether

You can peruse the Java source code for SentenceAndTokenOffsets which puts all the above code together in a runnable sample program. You will also find the source code in the src/edu/northwestern/at/examples/ directory in the MorphAdorner release.

Running the program

Executing SentenceAndTokenOffsets with the Gettysburg Address text as input produces the output below. Only show the first two sentences are shown. Long output lines have been folded.

Each sentence and word token is preceded with an ordinal starting at 0, followed by starting and ending character offsets in brackets. For example:

Sentence ordinal 0 starts at character 0 and ends at character 174.
Word ordinal 0 starts at character 0 and ends at character 3.

The word offsets are relative to the start of the sentence. Consider the word at ordinal 1 in sentence ordinal 1, "we", which starts at character position 6 relative to the start of the sentence. Its absolute character offset is 175 (the offset of sentence 1) + 6 or 181.

0 [0,174]: Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. 0 [0,3]: Four 1 [5,9]: score 2 [11,13]: and 3 [15,19]: seven 4 [21,25]: years 5 [27,29]: ago 6 [31,33]: our 7 [35,41]: fathers 8 [43,49]: brought 9 [51,55]: forth 10 [57,58]: on 11 [60,63]: this 12 [65,73]: continent 13 [75,75]: a 14 [77,79]: new 15 [81,86]: nation 16 [87,87]: , 17 [89,97]: conceived 18 [99,100]: in 19 [102,108]: Liberty 20 [109,109]: , 21 [111,113]: and 22 [115,123]: dedicated 23 [125,126]: to 24 [128,130]: the 25 [132,142]: proposition 26 [144,147]: that 27 [149,151]: all 28 [153,155]: men 29 [157,159]: are 30 [161,167]: created 31 [169,173]: equal 32 [174,174]: . 1 [175,308]: Now we are engaged in a great civil war, testing whether that nation, or any nation, so conceived and so dedicated, can long endure. 0 [2,4]: Now 1 [6,7]: we 2 [9,11]: are 3 [13,19]: engaged 4 [21,22]: in 5 [24,24]: a 6 [26,30]: great 7 [32,36]: civil 8 [38,40]: war 9 [41,41]: , 10 [43,49]: testing 11 [51,57]: whether 12 [59,62]: that 13 [64,69]: nation 14 [70,70]: , 15 [72,73]: or 16 [75,77]: any 17 [79,84]: nation 18 [85,85]: , 19 [87,88]: so 20 [90,98]: conceived 21 [100,102]: and 22 [104,105]: so 23 [107,115]: dedicated 24 [116,116]: , 25 [118,120]: can 26 [122,125]: long 27 [127,132]: endure 28 [133,133]: .

	Home
	Welcome
	Announcements and News
	Announcements and news about changes to MorphAdorner
	Documentation
	Documentation for using MorphAdorner
	Download MorphAdorner
	Downloading and installing the MorphAdorner client and server software
	Glossary
	Glossary of MorphAdorner terms
	Helpful References
	Natural language processing references
	Licenses
	Licenses for MorphAdorner and Associated Software
	Server
	Online examples of MorphAdorner Server facilities.
	Talks
	Slides from talks about MorphAdorner.
	Tech Talk
	Technical information for programmers using MorphAdorner