NU
IT
Northwestern University Information Technology |
MorphAdorner V2.0 | Site Map |
Let's extend the example of adorning a string with parts of speech to add lemma forms and standardized spellings for each word in the string.
We will use the default English lemmatizer and the default spelling standardizer.
// Get the default English // lemmatizer. Lemmatizer lemmatizer = new DefaultLemmatizer(); // Get the default spelling // standardizer. SpellingStandardizer standardizer = new DefaultSpellingStandardizer();
The process of adding parts of the speech is the same as in PosTagString. We call two new auxiliary methods to determine the lemmata and standard spelling for each part-of-speech tagged spelling.
for ( int j = 0 ; j < sentence.size() ; j++ ) { AdornedWord adornedWord = sentence.get( j ); // Get the standard spelling // given the original spelling // and part of speech. setStandardSpelling ( adornedWord , standardizer , partOfSpeechTags ); // Set the lemma. setLemma ( adornedWord , wordLexicon , lemmatizer , partOfSpeechTags , spellingTokenizer ); // Display the adornments. System.out.println ( StringUtils.rpad( ( j + 1 ) + "" , 3 ) + ": " + StringUtils.rpad( adornedWord.getSpelling() , 20 ) + StringUtils.rpad( adornedWord.getPartsOfSpeech() , 8 ) + StringUtils.rpad( adornedWord.getStandardSpelling() , 20 ) + adornedWord.getLemmata() ); }
We start by setting the lemma form to the spelling. If the spelling belongs to a word class which should not be further lemmatized, we do nothing further. We test for this by checking if the lemmatization class for the spelling's associated part of speech tag is "none" or if the language specific lemmatizer indicates that tag should not be lemmatized.
If the spelling should be lemmatized, we next check if there are multiple parts of speech in the spelling. If so, we try to find the lemma form for each part separately, and join them into a compound lemma, separating the individual pieces with the lemma form separator character. If the spelling has only a single part of speech, we find the lemma form that best fits the combination of spelling and part of speech.
/** Get lemma for a word. * * @param adornedWord The adorned word. * @param lexicon The word lexicon. * @param lemmatizer The lemmatizer. * @param partOfSpeechTags The part of speech tags. * @param spellingTokenizer Tokenizer for spelling. * * <p> * On output, sets the lemma field of the adorned word * We look in the word lexicon first for the lemma. * If the lexicon does not contain the lemma, we * use the lemmatizer. * </p> */ public static void setLemma ( AdornedWord adornedWord , Lexicon lexicon , Lemmatizer lemmatizer , PartOfSpeechTags partOfSpeechTags , WordTokenizer spellingTokenizer ) { String spelling = adornedWord.getSpelling(); String partOfSpeech = adornedWord.getPartsOfSpeech(); String lemmata = spelling; // Get lemmatization word class // for part of speech. String lemmaClass = partOfSpeechTags.getLemmaWordClass( partOfSpeech ); // Do not lemmatize words which // should not be lemmatized, // including proper names. if ( lemmatizer.cantLemmatize( spelling ) || lemmaClass.equals( "none" ) ) { } else { // Try compound word exceptions // list first. lemmata = lemmatizer.lemmatize( spelling , "compound" ); // If lemma not found, keep trying. if ( lemmata.equals( spelling ) ) { // Extract individual word parts. // May be more than one for a // contraction. List wordList = spellingTokenizer.extractWords( spelling ); // If just one word part, // get its lemma. if ( !partOfSpeechTags.isCompoundTag( partOfSpeech ) || ( wordList.size() == 1 ) ) { if ( lemmaClass.length() == 0 ) { lemmata = lemmatizer.lemmatize( spelling ); } else { lemmata = lemmatizer.lemmatize( spelling , lemmaClass ); } } // More than one word part. // Get lemma for each part and // concatenate them with the // lemma separator to form a // compound lemma. else { lemmata = ""; String lemmaPiece = ""; String[] posTags = partOfSpeechTags.splitTag( partOfSpeech ); if ( posTags.length == wordList.size() ) { for ( int i = 0 ; i < wordList.size() ; i++ ) { String wordPiece = (String)wordList.get( i ); if ( i > 0 ) { lemmata = lemmata + lemmaSeparator; } lemmaClass = partOfSpeechTags.getLemmaWordClass ( posTags[ i ] ); lemmaPiece = lemmatizer.lemmatize ( wordPiece , lemmaClass ); lemmata = lemmata + lemmaPiece; } } } } } adornedWord.setLemmata( lemmata ); } }
We start by setting the standardized form to the original spelling. If the spelling belongs to a word class which should not be standardized, we do nothing further. This includes spellings that are tagged as numbers, proper nouns, and foreign words.
If the spelling can be standardized, we ask the spelling standardizer to give us the best standardized form it can. We try to match the case of the original spelling in the standardized form. Alternatively we could always set the standardized form to a lower case version, except possibly for proper nouns and adjectives, and the pronoun "I".
/** Get standard spelling for a word. * * @param adornedWord The adorned word. * @param standardizer The spelling standardizer. * @param partOfSpeechTags The part of speech tags. * * <p> * On output, sets the standard spelling field of the adorned word * </p> */ public static void setStandardSpelling ( AdornedWord adornedWord , SpellingStandardizer standardizer , PartOfSpeechTags partOfSpeechTags ) { // Get the spelling. String spelling = adornedWord.getSpelling(); String standardSpelling = spelling; String partOfSpeech = adornedWord.getPartsOfSpeech(); // Leave proper nouns alone. if ( partOfSpeechTags.isProperNounTag( partOfSpeech ) ) { } // Leave nouns with internal // capitals alone. else if ( partOfSpeechTags.isNounTag( partOfSpeech ) && CharUtils.hasInternalCaps( spelling ) ) { } // Leave foreign words alone. else if ( partOfSpeechTags.isForeignWordTag( partOfSpeech ) ) { } // Leave numbers alone. else if ( partOfSpeechTags.isNumberTag( partOfSpeech ) ) { } // Anything else -- call the // standardizer on the spelling // to get the standard spelling. else { standardSpelling = standardizer.standardizeSpelling ( adornedWord.getSpelling() , partOfSpeechTags.getMajorWordClass ( adornedWord.getPartsOfSpeech() ) ); // If the standard spelling // is the same as the original // spelling except for case, // use the original spelling. if ( standardSpelling.equalsIgnoreCase( spelling ) ) { standardSpelling = spelling; } } // Set the standard spelling. adornedWord.setStandardSpelling( standardSpelling ); }
You can peruse the
Java source code for
AdornAString which puts all the above code
together in a runnable sample program. You will also find the source code
in the src/edu/northwestern/at/examples/
directory in the MorphAdorner release.
Home | |
Welcome | |
Announcements and News | |
Announcements and news about changes to MorphAdorner | |
Documentation | |
Documentation for using MorphAdorner | |
Download MorphAdorner | |
Downloading and installing the MorphAdorner client and server software | |
Glossary | |
Glossary of MorphAdorner terms | |
Helpful References | |
Natural language processing references | |
Licenses | |
Licenses for MorphAdorner and Associated Software | |
Server | |
Online examples of MorphAdorner Server facilities. | |
Talks | |
Slides from talks about MorphAdorner. | |
Tech Talk | |
Technical information for programmers using MorphAdorner |
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |
Contact Us.
|