NU
IT
Northwestern University Information Technology |
MorphAdorner V2.0 | Site Map |
You may want to locate word and sentence boundaries as a first step in text processing. Here we produce a program called SentenceAndTokenOffsets to find such boundaries and locate the character offsets of each sentence and word as well.
First you need to break up the text into sentences and words. In MorphAdorner you use a sentence splitter and a word tokenizer to perform these tasks. You can use MorphAdorner's default sentence splitter and default word tokenizer by creating an instance of each as follows.
WordTokenizer wordTokenizer = new DefaultWordTokenizer(); SentenceSplitter sentenceSplitter = new DefaultSentenceSplitter();
Note that the sentence splitter requires the word tokenizer as a parameter.
To improve the accuracy of the sentence splitter you can create a part of speech guesser using the default word lexicon and default suffix lexicon.
// Create part of speech guesser // for use by splitter. PartOfSpeechGuesser partOfSpeechGuesser = new DefaultPartOfSpeechGuesser(); // Get default word lexicon for // use by part of speech guesser. Lexicon lexicon = new DefaultWordLexicon(); // Set lexicon into guesser. partOfSpeechGuesser.setWordLexicon( lexicon ); // Get default suffix lexicon for // use by part of speech guesser. Lexicon suffixLexicon = new DefaultSuffixLexicon(); // Set suffix lexicon into guesser. partOfSpeechGuesser.setSuffixLexicon( suffixLexicon ); // Set guesser into sentence splitter. splitter.setPartOfSpeechGuesser( partOfSpeechGuesser );
Let's use Abraham Lincoln's "Gettysburg Address" as a sample text.
Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation, so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.
But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government : of the people, by the people, for the people, shall not perish from the earth.
Place that text into a utf-8 text file called gettysburg.txt . You can use a MorphAdorner utility method to read the text. You may want to convert all the whitespace characters into blanks for legibility and to avoid problems with platform specific end of line characters.
// Load text to split into
// sentences and tokens.
String sampleText =
FileUtils.readTextFile( inputFileName , "utf-8" );
// Convert all whitespace characters
// into blanks. (Not necessary,
// but makes the display cleaner below.)
sampleText = sampleText.replaceAll( "\\s" , " " );
Use the sentence splitter and word tokenizer to split the text into a java.util.List of sentences, each of which is in turn a java.util.List of word and punctuation tokens.
List<List<String>> sentences = sentenceSplitter.extractSentences( textToAdorn , wordTokenizer );
Next use the findSentenceOffsets method provided by the sentence splitter to get the list of sentence offsets. You can use these to find the end of each sentence as well.
// Get sentence start and end // offsets in input text. int[] sentenceOffsets = splitter.findSentenceOffsets( sampleText , sentences );
Within each sentence you can use the tokenizer method findWordOffsets to locate the start of each token in a sentence relative to the start of the sentence.
// Get offsets for each word token // relative to this sentence. int[] wordOffsets = tokenizer.findWordOffsets( sentence , words );
You can peruse the
Java source code for
SentenceAndTokenOffsets which puts all the above code
together in a runnable sample program. You will also find the source code
in the src/edu/northwestern/at/examples/
directory in the MorphAdorner release.
Executing SentenceAndTokenOffsets with the Gettysburg Address text as input produces the output below. Only show the first two sentences are shown. Long output lines have been folded.
Each sentence and word token is preceded with an ordinal starting at 0, followed by starting and ending character offsets in brackets. For example:
The word offsets are relative to the start of the sentence. Consider the word at ordinal 1 in sentence ordinal 1, "we", which starts at character position 6 relative to the start of the sentence. Its absolute character offset is 175 (the offset of sentence 1) + 6 or 181.
0 [0,174]: Four score and seven years ago our fathers brought forth
on this continent a new nation, conceived in Liberty, and dedicated
to the proposition that all men are created equal.
0 [0,3]: Four
1 [5,9]: score
2 [11,13]: and
3 [15,19]: seven
4 [21,25]: years
5 [27,29]: ago
6 [31,33]: our
7 [35,41]: fathers
8 [43,49]: brought
9 [51,55]: forth
10 [57,58]: on
11 [60,63]: this
12 [65,73]: continent
13 [75,75]: a
14 [77,79]: new
15 [81,86]: nation
16 [87,87]: ,
17 [89,97]: conceived
18 [99,100]: in
19 [102,108]: Liberty
20 [109,109]: ,
21 [111,113]: and
22 [115,123]: dedicated
23 [125,126]: to
24 [128,130]: the
25 [132,142]: proposition
26 [144,147]: that
27 [149,151]: all
28 [153,155]: men
29 [157,159]: are
30 [161,167]: created
31 [169,173]: equal
32 [174,174]: .
1 [175,308]: Now we are engaged in a great civil war,
testing whether that nation, or any nation, so conceived and
so dedicated, can long endure.
0 [2,4]: Now
1 [6,7]: we
2 [9,11]: are
3 [13,19]: engaged
4 [21,22]: in
5 [24,24]: a
6 [26,30]: great
7 [32,36]: civil
8 [38,40]: war
9 [41,41]: ,
10 [43,49]: testing
11 [51,57]: whether
12 [59,62]: that
13 [64,69]: nation
14 [70,70]: ,
15 [72,73]: or
16 [75,77]: any
17 [79,84]: nation
18 [85,85]: ,
19 [87,88]: so
20 [90,98]: conceived
21 [100,102]: and
22 [104,105]: so
23 [107,115]: dedicated
24 [116,116]: ,
25 [118,120]: can
26 [122,125]: long
27 [127,132]: endure
28 [133,133]: .
Home | |
Welcome | |
Announcements and News | |
Announcements and news about changes to MorphAdorner | |
Documentation | |
Documentation for using MorphAdorner | |
Download MorphAdorner | |
Downloading and installing the MorphAdorner client and server software | |
Glossary | |
Glossary of MorphAdorner terms | |
Helpful References | |
Natural language processing references | |
Licenses | |
Licenses for MorphAdorner and Associated Software | |
Server | |
Online examples of MorphAdorner Server facilities. | |
Talks | |
Slides from talks about MorphAdorner. | |
Tech Talk | |
Technical information for programmers using MorphAdorner |
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |
Contact Us.
|