NU
IT
Northwestern University Information Technology |
MorphAdorner V2.0 | Site Map |
Once you have MorphAdorned a text you probably want to do something with it. The AdornedXMLReader allows you to read an adorned file and extract a list of ExtendedAdornedWord entries. In addition to the morphological information encoded in adorned files, each ExtendedAdornedWord also provides extra information including the word and sentence number, whether a word occurs in main or paratext, whether a word occurs in verse, the XML tag path, and other things. AdornedXMLReader also allows you to extract sentences easily.
We will use Nathaniel Hawthorne's short story "The Shaker Bridal" from Twice Told Tales as a sample text. The adorned XML text is found in eaf434.zip.
To load the word information from an adorned file, create an AdornedXMLReader and pass the name of the adorned file to read as a parameter. Here we load eaf434.xml which contains the adorned XML for "The Shaker Bridal."
AdornedXMLReader xmlReader = new AdornedXMLReader( "eaf434.xml" );
To extract the list of word IDs, use the getAdornedWordIDs method of AdornedXMLReader.
List<String> wordIDs =
xmlReader.getAdornedWordIDs();
Given a word ID you can use the getExtendedAdornedWord method of AdornedXMLReader to obtain the word information as an ExtendedAdornedWord.
To extract the list of sentences, use the getSentences method of AdornedXMLReader.
List<List<ExtendedAdornedWord>> sentences =
xmlReader.getSentences();
You can regenerate displayable sentences using the SentenceMelder class, which only requires a list of ExtendedAdornedWord entries. Here we print the first five sentences of an adorned file.
PrintStream printStream =
new PrintStream
(
new BufferedOutputStream( System.out ) ,
true ,
"utf-8"
);
printStream.println();
printStream.println
(
"The first five sentences are:"
);
printStream.println();
printStream.println( StringUtils.dupl( "-" , 70 ) );
SentenceMelder melder = new SentenceMelder();
for ( int i = 0 ;
i < Math.min( 5 , sentences.size() ) ; i++ )
{
// Get text for this sentence.
String sentenceText =
melder.reconstituteSentence( sentences.get( i ) );
// Wrap the sentence text at column 70.
sentenceText =
StringUtils.wrapText(
sentenceText, Env.LINE_SEPARATOR , 70 );
// Print wrapped sentence text.
printStream.println
(
( i + 1 ) + ": " +
sentenceText
);
}
Each sentence is a list of ExtendedAdornedWord entries. For example, we can extract word information for each word in the second sentence of a text as follows.
List<ExtendedAdornedWord> sentence = sentences.get( 2 );
for ( int i = 0 ; i < sentence.size() ; i++ )
{
ExtendedAdornedWord adornedWord = sentence.get( i );
printStream.println( "Word " + ( i + 1 ) );
printStream.println(
" Word ID : " + adornedWord.getID() );
printStream.println(
" Token : " + adornedWord.getToken() );
printStream.println(
" Spelling : " + adornedWord.getSpelling() );
printStream.println(
" Lemmata : " + adornedWord.getLemmata() );
printStream.println(
" Pos tags : " +
adornedWord.getPartsOfSpeech() );
printStream.println(
" Standard spelling: " +
adornedWord.getStandardSpelling() );
printStream.println(
" Sentence number : " +
adornedWord.getSentenceNumber() );
printStream.println(
" Word number : " +
adornedWord.getWordNumber() );
printStream.println(
" XML path : " +
adornedWord.getPath() );
printStream.println(
" is EOS : " +
adornedWord.getEOS() );
printStream.println(
" word part flag : " +
adornedWord.getPart() );
printStream.println(
" word ordinal : " +
adornedWord.getOrd() );
printStream.println(
" page number : " +
adornedWord.getPageNumber() );
printStream.println(
" Main or side text: " +
adornedWord.getMainSide() );
printStream.println(
" is spoken : " +
adornedWord.getSpoken() );
printStream.println(
" is verse : " +
adornedWord.getVerse() );
printStream.println(
" in jump tag : " +
adornedWord.getInJumpTag() );
printStream.println(
" is a gap : " +
adornedWord.getGap() );
}
}
For example, the word information for the ninth word in the third sentence of "The Shaker Bridal" is:
Word ID : eaf434-00440 Token : Father Spelling : Father Lemmata : father Pos tags : n1 Standard spelling: Father Sentence number : 3 Word number : 9 XML path : \eaf434\body[1]\div[1]\p[1]\w[8] is EOS : false word part flag : N word ordinal : 21 page number : 8 Main or side text: MAIN is spoken : false is verse : false in jump tag : false is a gap : false
The XML word path takes the form
\document\struct[i]\struct2[j]\struct3[k]...\w[n]
where "document" is the document name (e.g., eaf434 for "The Shaker Bridal"), the "struct[]" elements are the XML tags names with numbers assigned in order of appearance in a given document subtree, and "w[]" is the word number with the current parent structural element. The path gives a flattened version of the XML ancestry for each word.
The structure numbers start at 1 (not 0) and start over for each document subtree. For example, this means that paragraph numbers (e.g., "p" element numbers) start over for each "div" .
Here is a typical word path ID:
\eaf434\body[1]\div[1]\p[1]\w[26]
In this example "eaf434" is the document name. "body[1]" is the first (and usually only) body element. div[1] corresponds to the first text division of "The Shaker Bridal" (but could be something else for another document). p[1] is paragraph 1, and w[26] is the twenty-sixth word in paragraph 1.
Given a list of adjacent adorned words, we can use their word paths to reconstitute an XML representation of the text for those words. We do this by using an XML element stack and pushing and popping XML elements as needed to represent the structural changes indicated by the succession of word path IDs. The XML will not match the original exactly, but is good enough for display purposes. The word range need not be confined to any specific structural element -- we can easily generate well-formed XML even when the range of words spans structural elements and indeed even if the word range does not correspond to complete sentences. This would not be true if we extracted the actual original XML corresponding to the span of word IDs.
To get the XML representation we use the generateXML method of AdornedXMLReader by passing the starting and ending word IDs for which we want the XML. The generateXML method uses the method just described to generate well-formed XML even the range of text specified by the word IDs spans XML structural boundaries.
String xml =
xmlReader.generateXML( firstWordID , secondWordID );
Consider the span of word IDs "eaf434-02040" through "eaf434-02780". This is a "nice" range which is wholly contained within interior structural elements. The reconstituted XML follows.
<body> <div> <p> His brethren of the north had now courteously invited him to be present on an occasion, when the concurrence of every eminent member of their community was peculiarly desirable. </p> <p> The venerable Father Ephraim sat in his easychair, not only hoary-headed and infirm with age, but worn down by a lingering disease, which, it was evident, would very soon transfer his patriarchal staff to other hands. </p> </div> </body>
Now consider the span of word IDs "eaf434-03630" through "eaf434-05250". These word IDs run over a paragraph boundary (marked by the XML <p> tag). The reconstituted XML follows.
<body> <div> <p> guided my choice aright.’ </p> <p> Accordingly, each elder looked at the two candidates with a most scrutinizing gaze. The man, whose name was Adam Colburn, had a face sunburnt with labor in the fields, yet intelligent, thoughtful, and traced with cares enough for a whole lifetime, though he had barely reached middle age. There was something severe in his aspect, and a rigidity throughout his person, characteristics that caused him generally to be taken for a schoolmaster; which vocation, in fact, he had formerly exercised for several years. The woman, Martha Pierson, was somewhat above thirty, thin and pale, as a Shaker sister almost invariably is, and not entirely free from that corpselike appearance, which the garb of the sisterhood is so well calculated to impart. </p> <p> ‘This pair are still in the summer </p> </div> </body>
The word paths can be searched using regular expression pattern matches to do things like count words that appear in particular XML nesting structures, find all sibling words in a given paragraph, and so on.
You can peruse the
Java source code for
UsingAnAdornedText which puts all the above code
together in a runnable sample program. You will also find the source code
in the src/edu/northwestern/at/examples/
directory in the MorphAdorner release.
Home | |
Welcome | |
Announcements and News | |
Announcements and news about changes to MorphAdorner | |
Documentation | |
Documentation for using MorphAdorner | |
Download MorphAdorner | |
Downloading and installing the MorphAdorner client and server software | |
Glossary | |
Glossary of MorphAdorner terms | |
Helpful References | |
Natural language processing references | |
Licenses | |
Licenses for MorphAdorner and Associated Software | |
Server | |
Online examples of MorphAdorner Server facilities. | |
Talks | |
Slides from talks about MorphAdorner. | |
Tech Talk | |
Technical information for programmers using MorphAdorner |
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |
Contact Us.
|