Poets that lasting marble seek,
Must carve in Latin or in Greek.
We write in sand, our language grows,
And like the tide, our work o'erflows.

-- Edmund Waller



Northwestern
MorphAdorner
    INFORMATION TECHNOLOGY  
    MorphAdorner Site Map  
MorphAdorner > Tech Talk > Adorning a String
 
Home
 
Announcements and News
 
Download MorphAdorner
 
Documentation
 
Licenses
 
Glossary
 
Helpful References
 
Tech Talk
 

Language Recognizer
 
Lemmatizer
 
Lexicon Lookup
 
Name Recognizer
 
Parser
 
Part of Speech Tagger
 
Pluralizer
 
Sentence Splitter
 
Spelling Standardizer
 
Text Segmenter
 
Verb Conjugator
 
Word Tokenizer
 
  Adorning A String
 
 

Adorning a string

Let's extend the example of adorning a string with parts of speech to add lemma forms and standardized spellings for each word in the string.

Creating a default lemmatizer and spelling standardizer

We will use the default English lemmatizer and the default spelling standardizer.

                                //  Get the default English
                                //  lemmatizer.

        Lemmatizer lemmatizer       =  new DefaultLemmatizer();

                                //  Get the default spelling
                                //  standardizer.

        SpellingStandardizer standardizer   =
            new DefaultSpellingStandardizer();

Adding lemmata and standardized spellings to the output

The process of adding parts of the speech is the same as in PosTagString. We call two new auxiliary methods to determine the lemmata and standard spelling for each part-of-speech tagged spelling.

            for ( int j = 0 ; j < sentence.size() ; j++ )
            {
                AdornedWord adornedWord = sentence.get( j );

                                //  Get the standard spelling
                                //  given the original spelling
                                //  and part of speech.

                setStandardSpelling
                (
                    adornedWord ,
                    standardizer ,
                    partOfSpeechTags
                );
                                //  Set the lemma.

                setLemma
                (
                    adornedWord ,
                    wordLexicon ,
                    lemmatizer ,
                    partOfSpeechTags ,
                    spellingTokenizer
                );

                                //  Display the adornments.

                System.out.println
                (
                    StringUtils.rpad( ( j + 1 ) + "" , 3  ) + ": " +
                    StringUtils.rpad( adornedWord.getSpelling() , 20 ) +
                    StringUtils.rpad(
                        adornedWord.getPartsOfSpeech() , 8 ) +
                    StringUtils.rpad(
                        adornedWord.getStandardSpelling() , 20 ) +
                    adornedWord.getLemmata()
                );
            }

Getting the lemma form

We start by setting the lemma form to the spelling. If the spelling belongs to a word class which should not be further lemmatized, we do nothing further. We test for this by checking if the lemmatization class for the spelling's associated part of speech tag is "none" or if the language specific lemmatizer indicates that tag should not be lemmatized.

If the spelling should be lemmatized, we next check if there are multiple parts of speech in the spelling. If so, we try to find the lemma form for each part separately, and join them into a compound lemma, separating the individual pieces with the lemma form separator character. If the spelling has only a single part of speech, we find the lemma form that best fits the combination of spelling and part of speech.

    /** Get lemma for a word.
     *
     *  @param  adornedWord         The adorned word.
     *  @param  lexicon             The word lexicon.
     *  @param  lemmatizer          The lemmatizer.
     *  @param  partOfSpeechTags    The part of speech tags.
     *  @param  spellingTokenizer   Tokenizer for spelling.
     *
     *  <p>
     *  On output, sets the lemma field of the adorned word
     *  We look in the word lexicon first for the lemma.
     *  If the lexicon does not contain the lemma, we
     *  use the lemmatizer.
     *  </p>
     */

    public static void setLemma
    (
        AdornedWord adornedWord  ,
        Lexicon lexicon ,
        Lemmatizer lemmatizer ,
        PartOfSpeechTags partOfSpeechTags ,
        WordTokenizer spellingTokenizer
    )
    {
        String spelling     = adornedWord.getSpelling();
        String partOfSpeech = adornedWord.getPartsOfSpeech();
        String lemmata      = spelling;

                                //  Get lemmatization word class
                                //  for part of speech.
        String lemmaClass   =
            partOfSpeechTags.getLemmaWordClass( partOfSpeech );

                                //  Do not lemmatize words which
                                //  should not be lemmatized,
                                //  including proper names.

        if  (   lemmatizer.cantLemmatize( spelling ) ||
                lemmaClass.equals( "none" )
            )
        {
        }
        else
        {
                                //  Try compound word exceptions
                                //  list first.

            lemmata = lemmatizer.lemmatize( spelling , "compound" );

                                //  If lemma not found, keep trying.

            if ( lemmata.equals( spelling ) )
            {
                                //  Extract individual word parts.
                                //  May be more than one for a
                                //  contraction.

                List wordList   =
                    spellingTokenizer.extractWords( spelling );

                                //  If just one word part,
                                //  get its lemma.

                if  (   !partOfSpeechTags.isCompoundTag( partOfSpeech ) ||
                        ( wordList.size() == 1 )
                    )
                {
                    if ( lemmaClass.length() == 0 )
                    {
                        lemmata = lemmatizer.lemmatize( spelling );
                    }
                    else
                    {
                        lemmata =
                            lemmatizer.lemmatize( spelling , lemmaClass );
                    }
                }
                                //  More than one word part.
                                //  Get lemma for each part and
                                //  concatenate them with the
                                //  lemma separator to form a
                                //  compound lemma.
                else
                {
                    lemmata             = "";
                    String lemmaPiece   = "";
                    String[] posTags    =
                        partOfSpeechTags.splitTag( partOfSpeech );

                    if ( posTags.length == wordList.size() )
                    {
                        for ( int i = 0 ; i < wordList.size() ; i++ )
                        {
                            String wordPiece    = (String)wordList.get( i );

                            if ( i > 0 )
                            {
                                lemmata = lemmata + lemmaSeparator;
                            }

                            lemmaClass  =
                                partOfSpeechTags.getLemmaWordClass
                                (
                                    posTags[ i ]
                                );

                            lemmaPiece  =
                                lemmatizer.lemmatize
                                (
                                    wordPiece ,
                                    lemmaClass
                                );

                            lemmata = lemmata + lemmaPiece;
                        }
                    }
                }
            }
        }

        adornedWord.setLemmata( lemmata );
    }
}

Getting the standardized spelling

We start by setting the standardized form to the original spelling. If the spelling belongs to a word class which should not be standardized, we do nothing further. This includes spellings that are tagged as numbers, proper nouns, and foreign words.

If the spelling can be standardized, we ask the spelling standardizer to give us the best standardized form it can. We try to match the case of the original spelling in the standardized form. Alternatively we could always set the standardized form to a lower case version, except possibly for proper nouns and adjectives, and the pronoun "I".

    /** Get standard spelling for a word.
     *
     *  @param  adornedWord     The adorned word.
     *  @param  standardizer        The spelling standardizer.
     *  @param  partOfSpeechTags    The part of speech tags.
     *
     *  <p>
     *  On output, sets the standard spelling field of the adorned word
     *  </p>
     */

    public static void setStandardSpelling
    (
        AdornedWord adornedWord  ,
        SpellingStandardizer standardizer ,
        PartOfSpeechTags partOfSpeechTags
    )
    {
                                //  Get the spelling.

        String spelling         = adornedWord.getSpelling();
        String standardSpelling = spelling;
        String partOfSpeech     = adornedWord.getPartsOfSpeech();

                                //  Leave proper nouns alone.

        if ( partOfSpeechTags.isProperNounTag( partOfSpeech ) )
        {
        }
                                //  Leave nouns with internal
                                //  capitals alone.

        else if (   partOfSpeechTags.isNounTag( partOfSpeech )  &&
                    CharUtils.hasInternalCaps( spelling ) )
        {
        }
                                //  Leave foreign words alone.

        else if ( partOfSpeechTags.isForeignWordTag( partOfSpeech ) )
        {
        }
                                //  Leave numbers alone.

        else if ( partOfSpeechTags.isNumberTag( partOfSpeech ) )
        {
        }
                                //  Anything else -- call the
                                //  standardizer on the spelling
                                //  to get the standard spelling.
        else
        {
            standardSpelling    =
                standardizer.standardizeSpelling
                (
                    adornedWord.getSpelling() ,
                    partOfSpeechTags.getMajorWordClass
                    (
                        adornedWord.getPartsOfSpeech()
                    )
                );

                                //  If the standard spelling
                                //  is the same as the original
                                //  spelling except for case,
                                //  use the original spelling.

            if ( standardSpelling.equalsIgnoreCase( spelling ) )
            {
                standardSpelling    = spelling;
            }
        }
                                //  Set the standard spelling.

        adornedWord.setStandardSpelling( standardSpelling );
    }

Putting it altogether

You can peruse the Java source code for AdornAString which puts all the above code together in a runnable sample program. You will also find the source code in the src/edu/northwestern/at/morphadorner/examples/ directory in the MorphAdorner release.

 

Information Technology | Academic Technologies | Scholarly Technologies 2East Resource Center |
Northwestern Home | Calendar: Plan-It Purple | Sites A-Z | Search
Academic Technologies  NU Library 2East  1970 Campus Drive  Evanston, IL 60208
E-mail: pib@northwestern.edu
Last updated Sun Mar 15 05:52:58 2009   World Wide Web Disclaimer and University Policy Statements   © 2007, 2008 Northwestern University