NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Adorning A String With Parts Of Speech

Adorning a string With Parts Of Speech

Suppose you have a string of text containing one or more sentences. How do you use MorphAdorner to assign part of speech tags to each word in the text?

Creating a default tokenizer and sentence splitter

First you need to break up the text into sentences and words. In MorphAdorner you use a sentence splitter and a word tokenizer to perform these tasks. You can use MorphAdorner's default sentence splitter and default word tokenizer by creating an instance of each as follows.

        WordTokenizer wordTokenizer = new DefaultWordTokenizer();
        SentenceSplitter sentenceSplitter   =
            new DefaultSentenceSplitter();

Use the sentence splitter and word tokenizer to split the text into a java.util.List of sentences, each of which is in turn a java.util.List of word and punctuation tokens. (Whitespace is not captured as part of the token list.) The text to split is stored in textToAdorn.

        List<List<String>> sentences    =
            sentenceSplitter.extractSentences(
                textToAdorn , wordTokenizer );

Note that the sentence splitter requires the word tokenizer as a parameter.

Getting the parts of speech

Next, create an instance of MorphAdorner's default part of speech tagger. The default tagger is a trigram tagger using a hidden Markov model and a beam search variant of the Viterbi algorithm. The default lexicon is a combination of an extensive English name list and words found in 19th century British fiction. The default part of speech tag set is the NUPOS tag set.

        PartOfSpeechTagger partOfSpeechTagger   =
            new DefaultPartOfSpeechTagger();

Now invoke the part of speech tagger to assign parts of speech to each word in the extracted sentences.

        List<List<AdornedWord>> taggedSentences =
            partOfSpeechTagger.tagSentences( sentences );

Displaying the results

The part of speech tagger returns a java.util.List of java.util.list entries. Each secondary java.util.List is a list of AdornedWord entries. Only the spelling and part of speech fields in each AdornedWord entry are guaranteed to be defined upon return from the part of speech tagger. You can display the results by extracting and printing the spelling and associated part of speech for each word.

        for ( int i = 0 ; i < sentences.size() ; i++ )
        {
                                //  Get the next adorned sentence.
                                //  This contains a list of adorned
                                //  words.  Only the spellings
                                //  and part of speech tags are
                                //  guaranteed to be defined.
            List<AdornedWord> sentence  = taggedSentences.get( i );
            System.out.println
            (
                "---------- Sentence " + ( i + 1 ) + "----------"
            );
                                //  Print out the spelling and part(s)
                                //  of speech for each word in the
                                //  sentence.  Punctuation is treated
                                //  as a word too.
            for ( int j = 0 ; j < sentence.size() ; j++ )
            {
                AdornedWord adornedWord = sentence.get( j );
                System.out.println
                (
                    StringUtils.rpad( ( j + 1 ) + "" , 3  ) + ": " +
                    StringUtils.rpad( adornedWord.getSpelling() , 20 ) +
                    adornedWord.getPartsOfSpeech()
                );
            }
        }
    }

Putting it altogether

You can peruse the Java source code for PosTagString which puts all the above code together in a runnable sample program. You will also find the source code in the src/edu/northwestern/at/morphadorner/examples/ directory in the MorphAdorner release.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk