Introduction
Once you have MorphAdorned a text you probably want to do something
with it. The
AdornedXMLReader allows you to read an adorned file and extract
a list of
ExtendedAdornedWord entries. In addition to the morphological
information encoded in adorned files, each ExtendedAdornedWord also
provides extra information including the word and sentence number,
whether a word occurs in main or paratext, whether a word occurs in
verse, the XML tag path, and other things. AdornedXMLReader also
allows you to extract sentences easily.
Sample text
We will use Nathaniel Hawthorne's short story "The Shaker Bridal" from
Twice Told Tales as a sample text. The adorned XML
text is found in eaf434.zip.
To load the word information from an adorned file, create an
AdornedXMLReader and pass the name of the adorned file to read
as a parameter. Here we load eaf434.xml which contains
the adorned XML for "The Shaker Bridal."
AdornedXMLReader xmlReader = new AdornedXMLReader( "eaf434.xml" );
To extract the list of word IDs, use the getAdornedWordIDs
method of AdornedXMLReader.
List<String> wordIDs =
xmlReader.getAdornedWordIDs();
Given a word ID you can use the getExtendedAdornedWord
method of AdornedXMLReader to obtain the word
information as an ExtendedAdornedWord.
To extract the list of sentences, use the getSentences
method of AdornedXMLReader.
List<List<ExtendedAdornedWord>> sentences =
xmlReader.getSentences();
Generating displayable sentences
You can regenerate displayable sentences using the
SentenceMelder class, which only requires a list of
ExtendedAdornedWord entries. Here we print the first
five sentences of an adorned file.
PrintStream printStream =
new PrintStream
(
new BufferedOutputStream( System.out ) ,
true ,
"utf-8"
);
printStream.println();
printStream.println
(
"The first five sentences are:"
);
printStream.println();
printStream.println( StringUtils.dupl( "-" , 70 ) );
SentenceMelder melder = new SentenceMelder();
for ( int i = 0 ;
i < Math.min( 5 , sentences.size() ) ; i++ )
{
// Get text for this sentence.
String sentenceText =
melder.reconstituteSentence( sentences.get( i ) );
// Wrap the sentence text at column 70.
sentenceText =
StringUtils.wrapText(
sentenceText, Env.LINE_SEPARATOR , 70 );
// Print wrapped sentence text.
printStream.println
(
( i + 1 ) + ": " +
sentenceText
);
}
Extracting individual word information
Each sentence is a list of ExtendedAdornedWord
entries. For example, we can extract word information for each word
in the second sentence of a text as follows.
List<ExtendedAdornedWord> sentence = sentences.get( 2 );
for ( int i = 0 ; i < sentence.size() ; i++ )
{
ExtendedAdornedWord adornedWord = sentence.get( i );
printStream.println( "Word " + ( i + 1 ) );
printStream.println(
" Word ID : " + adornedWord.getID() );
printStream.println(
" Token : " + adornedWord.getToken() );
printStream.println(
" Spelling : " + adornedWord.getSpelling() );
printStream.println(
" Lemmata : " + adornedWord.getLemmata() );
printStream.println(
" Pos tags : " +
adornedWord.getPartsOfSpeech() );
printStream.println(
" Standard spelling: " +
adornedWord.getStandardSpelling() );
printStream.println(
" Sentence number : " +
adornedWord.getSentenceNumber() );
printStream.println(
" Word number : " +
adornedWord.getWordNumber() );
printStream.println(
" XML path : " +
adornedWord.getPath() );
printStream.println(
" is EOS : " +
adornedWord.getEOS() );
printStream.println(
" word part flag : " +
adornedWord.getPart() );
printStream.println(
" word ordinal : " +
adornedWord.getOrd() );
printStream.println(
" page number : " +
adornedWord.getPageNumber() );
printStream.println(
" Main or side text: " +
adornedWord.getMainSide() );
printStream.println(
" is spoken : " +
adornedWord.getSpoken() );
printStream.println(
" is verse : " +
adornedWord.getVerse() );
printStream.println(
" in jump tag : " +
adornedWord.getInJumpTag() );
printStream.println(
" is a gap : " +
adornedWord.getGap() );
}
}
For example, the word information for the ninth word in the third
sentence of "The Shaker Bridal" is:
Word ID : eaf434-00440
Token : Father
Spelling : Father
Lemmata : father
Pos tags : n1
Standard spelling: Father
Sentence number : 3
Word number : 9
XML path : \eaf434\body[1]\div[1]\p[1]\w[8]
is EOS : false
word part flag : N
word ordinal : 21
page number : 8
Main or side text: MAIN
is spoken : false
is verse : false
in jump tag : false
is a gap : false
Word Paths
The XML word path takes the form
\document\struct[i]\struct2[j]\struct3[k]...\w[n]
where "document" is the document name (e.g., eaf434 for "The Shaker Bridal"),
the "struct[]" elements are the XML tags names with numbers assigned
in order of appearance in a given document subtree, and "w[]" is the
word number with the current parent structural element. The path
gives a flattened version of the XML ancestry for each word.
The structure numbers start at 1 (not 0) and start over for each document
subtree. For example, this means that paragraph numbers (e.g., "p"
element numbers) start over for each "div" .
Here is a typical word path ID:
\eaf434\body[1]\div[1]\p[1]\w[26]
In this example "eaf434" is the document name. "body[1]" is the
first (and usually only) body element. div[1] corresponds to the first
text division of "The Shaker Bridal" (but could be something else for
another document). p[1] is paragraph 1, and w[26] is
the twenty-sixth word in paragraph 1.
Generating XML
Given a list of adjacent adorned words, we can use their word paths to
reconstitute an XML representation of the text for those words. We do
this by using an XML element stack and pushing and popping XML elements
as needed to represent the structural changes indicated by the succession
of word path IDs. The XML will not match the original exactly, but is
good enough for display purposes. The word range need not be confined
to any specific structural element -- we can easily generate
well-formed XML even when the range of words spans structural elements and
indeed even if the word range does not correspond to complete sentences.
This would not be true if we extracted the actual original XML
corresponding to the span of word IDs.
To get the XML representation we use the
generateXML method of AdornedXMLReader
by passing the starting and ending word IDs for which we want the XML.
The generateXML method uses the method just described
to generate well-formed XML even the range of text specified by the
word IDs spans XML structural boundaries.
String xml =
xmlReader.generateXML( firstWordID , secondWordID );
Consider the span of word IDs "eaf434-02040"
through "eaf434-02780". This is a "nice" range which
is wholly contained within interior structural elements.
The reconstituted XML follows.
<body>
<div>
<p>
His brethren of the north had now courteously
invited him to be present on an occasion, when the concurrence of
every eminent member of their community was peculiarly desirable.
</p>
<p>
The venerable Father Ephraim sat in his easychair, not
only hoary-headed and infirm with age, but worn down by a
lingering disease, which, it was evident, would very soon
transfer his patriarchal staff to other hands.
</p>
</div>
</body>
Now consider the span of word IDs "eaf434-03630" through
"eaf434-05250". These word IDs run over a paragraph
boundary (marked by the XML <p> tag).
The reconstituted XML follows.
<body>
<div>
<p>
guided my choice aright.’
</p>
<p>
Accordingly, each elder looked at the two candidates
with a most scrutinizing gaze.
The man, whose name was Adam Colburn, had a face sunburnt with
labor in the fields, yet intelligent, thoughtful, and traced with
cares enough for a whole lifetime, though he had barely reached
middle age.
There was something severe in his aspect, and a rigidity
throughout his person, characteristics that caused him generally
to be taken for a schoolmaster; which vocation, in fact, he had
formerly exercised for several years.
The woman, Martha Pierson, was somewhat above thirty, thin and
pale, as a Shaker sister almost invariably is, and not entirely
free from that corpselike appearance, which the garb of the
sisterhood is so well calculated to impart.
</p>
<p>
‘This pair are still in the summer
</p>
</div>
</body>
Searching word paths
The word paths can be searched using regular expression pattern matches
to do things like count words that appear in particular XML nesting
structures, find all sibling words in a given paragraph, and so on.
Putting it altogether
You can peruse the
Java source code for
UsingAnAdornedText which puts all the above code
together in a runnable sample program. You will also find the source code
in the src/edu/northwestern/at/morphadorner/examples/
directory in the MorphAdorner release.
|