NU
IT
Northwestern University Information Technology |
MorphAdorner V2.0 | Site Map |
Suppose you have a string of text containing one or more sentences. How do you use MorphAdorner to assign part of speech tags to each word in the text?
First you need to break up the text into sentences and words. In MorphAdorner you use a sentence splitter and a word tokenizer to perform these tasks. You can use MorphAdorner's default sentence splitter and default word tokenizer by creating an instance of each as follows.
WordTokenizer wordTokenizer = new DefaultWordTokenizer(); SentenceSplitter sentenceSplitter = new DefaultSentenceSplitter();
Use the sentence splitter and word tokenizer to split
the text into a java.util.List of sentences, each of which is in turn
a java.util.List of word and punctuation tokens. (Whitespace is not
captured as part of the token list.) The text to split is stored in
textToAdorn
.
List<List<String>> sentences = sentenceSplitter.extractSentences( textToAdorn , wordTokenizer );
Note that the sentence splitter requires the word tokenizer as a parameter.
Next, create an instance of MorphAdorner's default part of speech tagger. The default tagger is a trigram tagger using a hidden Markov model and a beam search variant of the Viterbi algorithm. The default lexicon is a combination of an extensive English name list and words found in 19th century British fiction. The default part of speech tag set is the NUPOS tag set.
PartOfSpeechTagger partOfSpeechTagger = new DefaultPartOfSpeechTagger();
Now invoke the part of speech tagger to assign parts of speech to each word in the extracted sentences.
List<List<AdornedWord>> taggedSentences = partOfSpeechTagger.tagSentences( sentences );
The part of speech tagger returns a java.util.List of java.util.list entries. Each secondary java.util.List is a list of AdornedWord entries. Only the spelling and part of speech fields in each AdornedWord entry are guaranteed to be defined upon return from the part of speech tagger. You can display the results by extracting and printing the spelling and associated part of speech for each word.
for ( int i = 0 ; i < sentences.size() ; i++ ) { // Get the next adorned sentence. // This contains a list of adorned // words. Only the spellings // and part of speech tags are // guaranteed to be defined. List<AdornedWord> sentence = taggedSentences.get( i ); System.out.println ( "---------- Sentence " + ( i + 1 ) + "----------" ); // Print out the spelling and part(s) // of speech for each word in the // sentence. Punctuation is treated // as a word too. for ( int j = 0 ; j < sentence.size() ; j++ ) { AdornedWord adornedWord = sentence.get( j ); System.out.println ( StringUtils.rpad( ( j + 1 ) + "" , 3 ) + ": " + StringUtils.rpad( adornedWord.getSpelling() , 20 ) + adornedWord.getPartsOfSpeech() ); } } }
You can peruse the
Java source code for
PosTagString which puts all the above code
together in a runnable sample program. You will also find the source code
in the src/edu/northwestern/at/examples/
directory in the MorphAdorner release.
Home | |
Welcome | |
Announcements and News | |
Announcements and news about changes to MorphAdorner | |
Documentation | |
Documentation for using MorphAdorner | |
Download MorphAdorner | |
Downloading and installing the MorphAdorner client and server software | |
Glossary | |
Glossary of MorphAdorner terms | |
Helpful References | |
Natural language processing references | |
Licenses | |
Licenses for MorphAdorner and Associated Software | |
Server | |
Online examples of MorphAdorner Server facilities. | |
Talks | |
Slides from talks about MorphAdorner. | |
Tech Talk | |
Technical information for programmers using MorphAdorner |
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |
Contact Us.
|