NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Adorning A String

Adorning a string

Let's extend the example of adorning a string with parts of speech to add lemma forms and standardized spellings for each word in the string.

Creating a default lemmatizer and spelling standardizer

We will use the default English lemmatizer and the default spelling standardizer.

                                //  Get the default English
                                //  lemmatizer.
        Lemmatizer lemmatizer       =  new DefaultLemmatizer();
                                //  Get the default spelling
                                //  standardizer.
        SpellingStandardizer standardizer   =
            new DefaultSpellingStandardizer();

Adding lemmata and standardized spellings to the output

The process of adding parts of the speech is the same as in PosTagString. We call two new auxiliary methods to determine the lemmata and standard spelling for each part-of-speech tagged spelling.

            for ( int j = 0 ; j < sentence.size() ; j++ )
            {
                AdornedWord adornedWord = sentence.get( j );
                                //  Get the standard spelling
                                //  given the original spelling
                                //  and part of speech.
                setStandardSpelling
                (
                    adornedWord ,
                    standardizer ,
                    partOfSpeechTags
                );
                                //  Set the lemma.
                setLemma
                (
                    adornedWord ,
                    wordLexicon ,
                    lemmatizer ,
                    partOfSpeechTags ,
                    spellingTokenizer
                );
                                //  Display the adornments.
                System.out.println
                (
                    StringUtils.rpad( ( j + 1 ) + "" , 3  ) + ": " +
                    StringUtils.rpad( adornedWord.getSpelling() , 20 ) +
                    StringUtils.rpad(
                        adornedWord.getPartsOfSpeech() , 8 ) +
                    StringUtils.rpad(
                        adornedWord.getStandardSpelling() , 20 ) +
                    adornedWord.getLemmata()
                );
            }

Getting the lemma form

We start by setting the lemma form to the spelling. If the spelling belongs to a word class which should not be further lemmatized, we do nothing further. We test for this by checking if the lemmatization class for the spelling's associated part of speech tag is "none" or if the language specific lemmatizer indicates that tag should not be lemmatized.

If the spelling should be lemmatized, we next check if there are multiple parts of speech in the spelling. If so, we try to find the lemma form for each part separately, and join them into a compound lemma, separating the individual pieces with the lemma form separator character. If the spelling has only a single part of speech, we find the lemma form that best fits the combination of spelling and part of speech.

    /** Get lemma for a word.
     *
     *  @param  adornedWord         The adorned word.
     *  @param  lexicon             The word lexicon.
     *  @param  lemmatizer          The lemmatizer.
     *  @param  partOfSpeechTags    The part of speech tags.
     *  @param  spellingTokenizer   Tokenizer for spelling.
     *
     *  <p>
     *  On output, sets the lemma field of the adorned word
     *  We look in the word lexicon first for the lemma.
     *  If the lexicon does not contain the lemma, we
     *  use the lemmatizer.
     *  </p>
     */
    public static void setLemma
    (
        AdornedWord adornedWord  ,
        Lexicon lexicon ,
        Lemmatizer lemmatizer ,
        PartOfSpeechTags partOfSpeechTags ,
        WordTokenizer spellingTokenizer
    )
    {
        String spelling     = adornedWord.getSpelling();
        String partOfSpeech = adornedWord.getPartsOfSpeech();
        String lemmata      = spelling;
                                //  Get lemmatization word class
                                //  for part of speech.
        String lemmaClass   =
            partOfSpeechTags.getLemmaWordClass( partOfSpeech );
                                //  Do not lemmatize words which
                                //  should not be lemmatized,
                                //  including proper names.
        if  (   lemmatizer.cantLemmatize( spelling ) ||
                lemmaClass.equals( "none" )
            )
        {
        }
        else
        {
                                //  Try compound word exceptions
                                //  list first.
            lemmata = lemmatizer.lemmatize( spelling , "compound" );
                                //  If lemma not found, keep trying.
            if ( lemmata.equals( spelling ) )
            {
                                //  Extract individual word parts.
                                //  May be more than one for a
                                //  contraction.
                List wordList   =
                    spellingTokenizer.extractWords( spelling );
                                //  If just one word part,
                                //  get its lemma.
                if  (   !partOfSpeechTags.isCompoundTag( partOfSpeech ) ||
                        ( wordList.size() == 1 )
                    )
                {
                    if ( lemmaClass.length() == 0 )
                    {
                        lemmata = lemmatizer.lemmatize( spelling );
                    }
                    else
                    {
                        lemmata =
                            lemmatizer.lemmatize( spelling , lemmaClass );
                    }
                }
                                //  More than one word part.
                                //  Get lemma for each part and
                                //  concatenate them with the
                                //  lemma separator to form a
                                //  compound lemma.
                else
                {
                    lemmata             = "";
                    String lemmaPiece   = "";
                    String[] posTags    =
                        partOfSpeechTags.splitTag( partOfSpeech );
                    if ( posTags.length == wordList.size() )
                    {
                        for ( int i = 0 ; i < wordList.size() ; i++ )
                        {
                            String wordPiece    = (String)wordList.get( i );
                            if ( i > 0 )
                            {
                                lemmata = lemmata + lemmaSeparator;
                            }
                            lemmaClass  =
                                partOfSpeechTags.getLemmaWordClass
                                (
                                    posTags[ i ]
                                );
                            lemmaPiece  =
                                lemmatizer.lemmatize
                                (
                                    wordPiece ,
                                    lemmaClass
                                );
                            lemmata = lemmata + lemmaPiece;
                        }
                    }
                }
            }
        }
        adornedWord.setLemmata( lemmata );
    }
}

Getting the standardized spelling

We start by setting the standardized form to the original spelling. If the spelling belongs to a word class which should not be standardized, we do nothing further. This includes spellings that are tagged as numbers, proper nouns, and foreign words.

If the spelling can be standardized, we ask the spelling standardizer to give us the best standardized form it can. We try to match the case of the original spelling in the standardized form. Alternatively we could always set the standardized form to a lower case version, except possibly for proper nouns and adjectives, and the pronoun "I".

    /** Get standard spelling for a word.
     *
     *  @param  adornedWord     The adorned word.
     *  @param  standardizer        The spelling standardizer.
     *  @param  partOfSpeechTags    The part of speech tags.
     *
     *  <p>
     *  On output, sets the standard spelling field of the adorned word
     *  </p>
     */
    public static void setStandardSpelling
    (
        AdornedWord adornedWord  ,
        SpellingStandardizer standardizer ,
        PartOfSpeechTags partOfSpeechTags
    )
    {
                                //  Get the spelling.
        String spelling         = adornedWord.getSpelling();
        String standardSpelling = spelling;
        String partOfSpeech     = adornedWord.getPartsOfSpeech();
                                //  Leave proper nouns alone.
        if ( partOfSpeechTags.isProperNounTag( partOfSpeech ) )
        {
        }
                                //  Leave nouns with internal
                                //  capitals alone.
        else if (   partOfSpeechTags.isNounTag( partOfSpeech )  &&
                    CharUtils.hasInternalCaps( spelling ) )
        {
        }
                                //  Leave foreign words alone.
        else if ( partOfSpeechTags.isForeignWordTag( partOfSpeech ) )
        {
        }
                                //  Leave numbers alone.
        else if ( partOfSpeechTags.isNumberTag( partOfSpeech ) )
        {
        }
                                //  Anything else -- call the
                                //  standardizer on the spelling
                                //  to get the standard spelling.
        else
        {
            standardSpelling    =
                standardizer.standardizeSpelling
                (
                    adornedWord.getSpelling() ,
                    partOfSpeechTags.getMajorWordClass
                    (
                        adornedWord.getPartsOfSpeech()
                    )
                );
                                //  If the standard spelling
                                //  is the same as the original
                                //  spelling except for case,
                                //  use the original spelling.
            if ( standardSpelling.equalsIgnoreCase( spelling ) )
            {
                standardSpelling    = spelling;
            }
        }
                                //  Set the standard spelling.
        adornedWord.setStandardSpelling( standardSpelling );
    }

Putting it altogether

You can peruse the Java source code for AdornAString which puts all the above code together in a runnable sample program. You will also find the source code in the src/edu/northwestern/at/examples/ directory in the MorphAdorner release.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk