Adorning a string With Parts Of Speech
Suppose you have a string of text containing one or more sentences.
How do you use MorphAdorner to assign part of speech tags to
each word in the text?
Creating a default tokenizer and sentence splitter
First you need to break up the text into sentences and words. In MorphAdorner
you use a
sentence splitter
and a
word tokenizer
to perform these tasks.
You can use MorphAdorner's
default sentence splitter and
default word tokenizer
by creating an instance of each as follows.
WordTokenizer wordTokenizer = new DefaultWordTokenizer();
SentenceSplitter sentenceSplitter =
new DefaultSentenceSplitter();
Use the sentence splitter and word tokenizer to split
the text into a java.util.List of sentences, each of which is in turn
a java.util.List of word and punctuation tokens. (Whitespace is not
captured as part of the token list.) The text to split is stored in
textToAdorn.
List<List<String>> sentences =
sentenceSplitter.extractSentences(
textToAdorn , wordTokenizer );
Note that the sentence splitter requires the word tokenizer as a
parameter.
Getting the parts of speech
Next, create an instance of MorphAdorner's default part of speech tagger.
The default tagger is a trigram tagger using a
hidden Markov model
and a beam search variant of the
Viterbi algorithm.
The default lexicon is a combination of an extensive English name list and
words found in 19th century British fiction. The default part of
speech tag set is the NUPOS tag set.
PartOfSpeechTagger partOfSpeechTagger =
new DefaultPartOfSpeechTagger();
Now invoke the part of speech tagger to assign
parts of speech to each word in the extracted sentences.
List<List<AdornedWord>> taggedSentences =
partOfSpeechTagger.tagSentences( sentences );
Displaying the results
The part of speech tagger returns a java.util.List of
java.util.list entries. Each secondary java.util.List is a list of
AdornedWord
entries. Only the spelling and
part of speech fields in each AdornedWord entry are guaranteed to be
defined upon return from the part of speech tagger. You can
display the results by extracting and printing the spelling
and associated part of speech for each word.
for ( int i = 0 ; i < sentences.size() ; i++ )
{
// Get the next adorned sentence.
// This contains a list of adorned
// words. Only the spellings
// and part of speech tags are
// guaranteed to be defined.
List<AdornedWord> sentence = taggedSentences.get( i );
System.out.println
(
"---------- Sentence " + ( i + 1 ) + "----------"
);
// Print out the spelling and part(s)
// of speech for each word in the
// sentence. Punctuation is treated
// as a word too.
for ( int j = 0 ; j < sentence.size() ; j++ )
{
AdornedWord adornedWord = sentence.get( j );
System.out.println
(
StringUtils.rpad( ( j + 1 ) + "" , 3 ) + ": " +
StringUtils.rpad( adornedWord.getSpelling() , 20 ) +
adornedWord.getPartsOfSpeech()
);
}
}
}
Putting it altogether
You can peruse the
Java source code for
PosTagString which puts all the above code
together in a runnable sample program. You will also find the source code
in the src/edu/northwestern/at/morphadorner/examples/
directory in the MorphAdorner release.
|