NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Using An Adorned Text

Introduction

Once you have MorphAdorned a text you probably want to do something with it. The AdornedXMLReader allows you to read an adorned file and extract a list of ExtendedAdornedWord entries. In addition to the morphological information encoded in adorned files, each ExtendedAdornedWord also provides extra information including the word and sentence number, whether a word occurs in main or paratext, whether a word occurs in verse, the XML tag path, and other things. AdornedXMLReader also allows you to extract sentences easily.

Sample text

We will use Nathaniel Hawthorne's short story "The Shaker Bridal" from Twice Told Tales as a sample text. The adorned XML text is found in eaf434.zip.

To load the word information from an adorned file, create an AdornedXMLReader and pass the name of the adorned file to read as a parameter. Here we load eaf434.xml which contains the adorned XML for "The Shaker Bridal."

    AdornedXMLReader xmlReader = new AdornedXMLReader( "eaf434.xml" );

To extract the list of word IDs, use the getAdornedWordIDs method of AdornedXMLReader.

    List<String> wordIDs   =
        xmlReader.getAdornedWordIDs();

Given a word ID you can use the getExtendedAdornedWord method of AdornedXMLReader to obtain the word information as an ExtendedAdornedWord.

To extract the list of sentences, use the getSentences method of AdornedXMLReader.

    List<List<ExtendedAdornedWord>> sentences   =
        xmlReader.getSentences();

Generating displayable sentences

You can regenerate displayable sentences using the SentenceMelder class, which only requires a list of ExtendedAdornedWord entries. Here we print the first five sentences of an adorned file.

    PrintStream printStream =
            new PrintStream
            (
                new BufferedOutputStream( System.out ) ,
                true ,
                "utf-8"
            );
    printStream.println();
    printStream.println
    (
        "The first five sentences are:"
    );
    printStream.println();
    printStream.println( StringUtils.dupl( "-" , 70 ) );
    SentenceMelder melder   = new SentenceMelder();
    for ( int i = 0 ;
        i < Math.min( 5 , sentences.size() ) ; i++ )
    {
                    //  Get text for this sentence.
        String sentenceText =
            melder.reconstituteSentence( sentences.get( i ) );
                    //  Wrap the sentence text at column 70.
        sentenceText    =
            StringUtils.wrapText(
                sentenceText, Env.LINE_SEPARATOR , 70 );
                    //  Print wrapped sentence text.
        printStream.println
        (
            ( i + 1 ) + ": " +
            sentenceText
        );
    }

Extracting individual word information

Each sentence is a list of ExtendedAdornedWord entries. For example, we can extract word information for each word in the second sentence of a text as follows.

    List<ExtendedAdornedWord> sentence  = sentences.get( 2 );
    for ( int i = 0 ; i < sentence.size() ; i++ )
    {
        ExtendedAdornedWord adornedWord = sentence.get( i );
        printStream.println( "Word " + ( i + 1 ) );
        printStream.println(
            "  Word ID          : " + adornedWord.getID() );
        printStream.println(
            "  Token            : " + adornedWord.getToken() );
        printStream.println(
            "  Spelling         : " + adornedWord.getSpelling() );
        printStream.println(
            "  Lemmata          : " + adornedWord.getLemmata() );
        printStream.println(
            "  Pos tags         : " +
            adornedWord.getPartsOfSpeech() );
        printStream.println(
            "  Standard spelling: " +
            adornedWord.getStandardSpelling() );
        printStream.println(
            "  Sentence number  : " +
            adornedWord.getSentenceNumber() );
        printStream.println(
            "  Word number      : " +
            adornedWord.getWordNumber() );
        printStream.println(
            "  XML path         : " +
            adornedWord.getPath() );
        printStream.println(
            "  is EOS           : " +
            adornedWord.getEOS() );
        printStream.println(
            "  word part flag   : " +
            adornedWord.getPart() );
        printStream.println(
            "  word ordinal     : " +
            adornedWord.getOrd() );
        printStream.println(
            "  page number      : " +
            adornedWord.getPageNumber() );
        printStream.println(
            "  Main or side text: " +
            adornedWord.getMainSide() );
        printStream.println(
            "  is spoken        : " +
            adornedWord.getSpoken() );
        printStream.println(
            "  is verse         : " +
            adornedWord.getVerse() );
        printStream.println(
            "  in jump tag      : " +
            adornedWord.getInJumpTag() );
        printStream.println(
            "  is a gap         : " +
            adornedWord.getGap() );
     }
}

For example, the word information for the ninth word in the third sentence of "The Shaker Bridal" is:

    Word ID          : eaf434-00440
    Token            : Father
    Spelling         : Father
    Lemmata          : father
    Pos tags         : n1
    Standard spelling: Father
    Sentence number  : 3
    Word number      : 9
    XML path         : \eaf434\body[1]\div[1]\p[1]\w[8]
    is EOS           : false
    word part flag   : N
    word ordinal     : 21
    page number      : 8
    Main or side text: MAIN
    is spoken        : false
    is verse         : false
    in jump tag      : false
    is a gap         : false

Word Paths

The XML word path takes the form

    \document\struct[i]\struct2[j]\struct3[k]...\w[n]

where "document" is the document name (e.g., eaf434 for "The Shaker Bridal"), the "struct[]" elements are the XML tags names with numbers assigned in order of appearance in a given document subtree, and "w[]" is the word number with the current parent structural element. The path gives a flattened version of the XML ancestry for each word.

The structure numbers start at 1 (not 0) and start over for each document subtree. For example, this means that paragraph numbers (e.g., "p" element numbers) start over for each "div" .

Here is a typical word path ID:

    \eaf434\body[1]\div[1]\p[1]\w[26]

In this example "eaf434" is the document name. "body[1]" is the first (and usually only) body element. div[1] corresponds to the first text division of "The Shaker Bridal" (but could be something else for another document). p[1] is paragraph 1, and w[26] is the twenty-sixth word in paragraph 1.

Generating XML

Given a list of adjacent adorned words, we can use their word paths to reconstitute an XML representation of the text for those words. We do this by using an XML element stack and pushing and popping XML elements as needed to represent the structural changes indicated by the succession of word path IDs. The XML will not match the original exactly, but is good enough for display purposes. The word range need not be confined to any specific structural element -- we can easily generate well-formed XML even when the range of words spans structural elements and indeed even if the word range does not correspond to complete sentences. This would not be true if we extracted the actual original XML corresponding to the span of word IDs.

To get the XML representation we use the generateXML method of AdornedXMLReader by passing the starting and ending word IDs for which we want the XML. The generateXML method uses the method just described to generate well-formed XML even the range of text specified by the word IDs spans XML structural boundaries.

    String xml  =
        xmlReader.generateXML( firstWordID , secondWordID );

Consider the span of word IDs "eaf434-02040" through "eaf434-02780". This is a "nice" range which is wholly contained within interior structural elements. The reconstituted XML follows.

    <body>
    <div>
    <p>
    His brethren of the north had now courteously
    invited him to be present on an occasion, when the concurrence of
    every eminent member of their community was peculiarly desirable.
    </p>
    <p>
    The venerable Father Ephraim sat in his easychair, not
    only hoary-headed and infirm with age, but worn down by a
    lingering disease, which, it was evident, would very soon
    transfer his patriarchal staff to other hands.
    </p>
    </div>
    </body>

Now consider the span of word IDs "eaf434-03630" through "eaf434-05250". These word IDs run over a paragraph boundary (marked by the XML <p> tag). The reconstituted XML follows.

    <body>
    <div>
    <p>
    guided my choice aright.’
    </p>
    <p>
    Accordingly, each elder looked at the two candidates
    with a most scrutinizing gaze.
    The man, whose name was Adam Colburn, had a face sunburnt with
    labor in the fields, yet intelligent, thoughtful, and traced with
    cares enough for a whole lifetime, though he had barely reached
    middle age.
    There was something severe in his aspect, and a rigidity
    throughout his person, characteristics that caused him generally
    to be taken for a schoolmaster; which vocation, in fact, he had
    formerly exercised for several years.
    The woman, Martha Pierson, was somewhat above thirty, thin and
    pale, as a Shaker sister almost invariably is, and not entirely
    free from that corpselike appearance, which the garb of the
    sisterhood is so well calculated to impart.
    </p>
    <p>
    ‘This pair are still in the summer
    </p>
    </div>
    </body>

Searching word paths

The word paths can be searched using regular expression pattern matches to do things like count words that appear in particular XML nesting structures, find all sibling words in a given paragraph, and so on.

Putting it altogether

You can peruse the Java source code for UsingAnAdornedText which puts all the above code together in a runnable sample program. You will also find the source code in the src/edu/northwestern/at/examples/ directory in the MorphAdorner release.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk