|
|
Finding sentence and token offsets in plain text
You may want to locate word and sentence boundaries
as a first step in text processing. Here we produce a program called
SentenceAndTokenOffsets to find such boundaries
and locate the character offsets of each sentence and word as well.
First you need to break up the text into sentences and words. In MorphAdorner
you use a
sentence splitter
and a
word tokenizer
to perform these tasks.
You can use MorphAdorner's
default sentence splitter and
default word tokenizer
by creating an instance of each as follows.
WordTokenizer wordTokenizer = new DefaultWordTokenizer();
SentenceSplitter sentenceSplitter =
new DefaultSentenceSplitter();
Note that the sentence splitter requires the word tokenizer as a
parameter.
To improve the accuracy of the sentence splitter you can create
a
part of speech guesser
using the
default word lexicon
and
default suffix lexicon.
// Create part of speech guesser
// for use by splitter.
PartOfSpeechGuesser partOfSpeechGuesser =
new DefaultPartOfSpeechGuesser();
// Get default word lexicon for
// use by part of speech guesser.
Lexicon lexicon = new DefaultWordLexicon();
// Set lexicon into guesser.
partOfSpeechGuesser.setWordLexicon( lexicon );
// Get default suffix lexicon for
// use by part of speech guesser.
Lexicon suffixLexicon = new DefaultSuffixLexicon();
// Set suffix lexicon into guesser.
partOfSpeechGuesser.setSuffixLexicon( suffixLexicon );
// Set guesser into sentence splitter.
splitter.setPartOfSpeechGuesser( partOfSpeechGuesser );
Sample text: Lincoln's Gettysburg Address
Let's use Abraham Lincoln's "Gettysburg Address" as a sample text.
Four score and seven years ago our fathers brought forth on this
continent a new nation, conceived in Liberty, and dedicated to
the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that
nation, or any nation, so conceived and so dedicated, can long
endure. We are met on a great battle-field of that war. We have
come to dedicate a portion of that field, as a final resting
place for those who here gave their lives that that nation might
live. It is altogether fitting and proper that we should do
this.
But, in a larger sense, we can not dedicate—we can not
consecrate—we can not hallow—this ground. The brave men, living
and dead, who struggled here, have consecrated it, far above our
poor power to add or detract. The world will little note, nor
long remember what we say here, but it can never forget what
they did here. It is for us the living, rather, to be dedicated
here to the unfinished work which they who fought here have thus
far so nobly advanced. It is rather for us to be here dedicated
to the great task remaining before us—that from these honored
dead we take increased devotion to that cause for which they
gave the last full measure of devotion—that we here highly
resolve that these dead shall not have died in vain—that this
nation, under God, shall have a new birth of freedom—and that
government : of the people, by the people, for the people, shall
not perish from the earth.
Sample output
Place that text into a utf-8 text file called
gettysburg.txt .
You can use a MorphAdorner utility method
to read the text. You may want to convert all the whitespace characters
into blanks for legibility and to avoid problems with platform
specific end of line characters.
// Load text to split into
// sentences and tokens.
String sampleText =
FileUtils.readTextFile( inputFileName , "utf-8" );
// Convert all whitespace characters
// into blanks. (Not necessary,
// but makes the display cleaner below.)
sampleText = sampleText.replaceAll( "\\s" , " " );
Use the sentence splitter and word tokenizer to split
the text into a java.util.List of sentences, each of which is in turn
a java.util.List of word and punctuation tokens.
List<List<String>> sentences =
sentenceSplitter.extractSentences(
textToAdorn , wordTokenizer );
Next use the findSentenceOffsets method provided by the
sentence splitter to get the list of sentence offsets. You
can use these to find the end of each sentence as well.
// Get sentence start and end
// offsets in input text.
int[] sentenceOffsets =
splitter.findSentenceOffsets( sampleText , sentences );
Within each sentence you can use the tokenizer method
findWordOffsets to locate the start of each token
in a sentence relative to the start of the sentence.
// Get offsets for each word token
// relative to this sentence.
int[] wordOffsets =
tokenizer.findWordOffsets( sentence , words );
Putting it altogether
You can peruse the
Java source code for
SentenceAndTokenOffsets which puts all the above code
together in a runnable sample program. You will also find the source code
in the src/edu/northwestern/at/morphadorner/examples/
directory in the MorphAdorner release.
Running the program
Executing SentenceAndTokenOffsets
with the Gettysburg Address text as input produces the output below.
Only show the first two sentences are shown. Long output lines have been
folded.
Each sentence and word token is preceded with
an ordinal starting at 0, followed by starting and ending character
offsets in brackets. For example:
- Sentence ordinal 0 starts at character
0 and ends at character 174.
- Word ordinal 0 starts at character 0 and
ends at character 3.
The word offsets are relative to the start
of the sentence. Consider the word at ordinal 1 in sentence ordinal
1, "we", which starts at character position 6 relative to the start
of the sentence. Its absolute character offset is 175 (the offset
of sentence 1) + 6 or 181.
0 [0,174]: Four score and seven years ago our fathers brought forth
on this continent a new nation, conceived in Liberty, and dedicated
to the proposition that all men are created equal.
0 [0,3]: Four
1 [5,9]: score
2 [11,13]: and
3 [15,19]: seven
4 [21,25]: years
5 [27,29]: ago
6 [31,33]: our
7 [35,41]: fathers
8 [43,49]: brought
9 [51,55]: forth
10 [57,58]: on
11 [60,63]: this
12 [65,73]: continent
13 [75,75]: a
14 [77,79]: new
15 [81,86]: nation
16 [87,87]: ,
17 [89,97]: conceived
18 [99,100]: in
19 [102,108]: Liberty
20 [109,109]: ,
21 [111,113]: and
22 [115,123]: dedicated
23 [125,126]: to
24 [128,130]: the
25 [132,142]: proposition
26 [144,147]: that
27 [149,151]: all
28 [153,155]: men
29 [157,159]: are
30 [161,167]: created
31 [169,173]: equal
32 [174,174]: .
1 [175,308]: Now we are engaged in a great civil war,
testing whether that nation, or any nation, so conceived and
so dedicated, can long endure.
0 [2,4]: Now
1 [6,7]: we
2 [9,11]: are
3 [13,19]: engaged
4 [21,22]: in
5 [24,24]: a
6 [26,30]: great
7 [32,36]: civil
8 [38,40]: war
9 [41,41]: ,
10 [43,49]: testing
11 [51,57]: whether
12 [59,62]: that
13 [64,69]: nation
14 [70,70]: ,
15 [72,73]: or
16 [75,77]: any
17 [79,84]: nation
18 [85,85]: ,
19 [87,88]: so
20 [90,98]: conceived
21 [100,102]: and
22 [104,105]: so
23 [107,115]: dedicated
24 [116,116]: ,
25 [118,120]: can
26 [122,125]: long
27 [127,132]: endure
28 [133,133]: .
|
|