Poets that lasting marble seek,
Must carve in Latin or in Greek.
We write in sand, our language grows,
And like the tide, our work o'erflows.

-- Edmund Waller



Northwestern
MorphAdorner
    INFORMATION TECHNOLOGY  
    MorphAdorner Site Map  
MorphAdorner > Documentation > Stripping Word Attributes
 
Home
 
Announcements and News
 
Download MorphAdorner
 
Documentation
 
Licenses
 
Glossary
 
Helpful References
 
Tech Talk
 

Language Recognizer
 
Lemmatizer
 
Lexicon Lookup
 
Name Recognizer
 
Parser
 
Part of Speech Tagger
 
Pluralizer
 
Sentence Splitter
 
Spelling Standardizer
 
Text Segmenter
 
Verb Conjugator
 
Word Tokenizer
 
  Stripping Word Attributes
 
 

StripWordAttributes creates a derived MorphAdorner XML file with word elements stripped of attributes.

Usage:

stripwordattributes input.xml output.xml output.tab [/[no]id] [/[no]trim]

where

input.xmlInput MorphAdorned xml file.
output.xml Derived adorned file with word element attributes stripped.
output.tab Tab delimited file of word element attribute values.
/id or /noid Optional parameter indicating xml:id should be left attached to each word (<w>) element. Default is /noid which removes the xml:id attribute and value.
/trim or /notrim Optional parameter indicating whether whitespace should be trimmed from the start and end of each XML text line. Default is /notrim, which leaves the original whitespace intact.

The derived adorned output file output.xml has all attributes stripped from each <w> tag.

The attribute values for each "<w>" element in the input.xml file are extracted and output to the tab-separated values output.tab file. The order of the attribute lines matches the order of appearance of the <w> elements in the XML output file. When /id is specified the xml:id value in each <w> element in output.xml can be matched with the corresponding xml:id value in output.tab .

The first line in output.tab contains the attribute names for each column. Each subsequent line in the output.tab file contains at least the following information corresponding to a single word "<w>" element. Some adorned files may add extra word attributes, resulting in more columns.

  1. xml:id -- the permanent word ID.
  2. eos -- the end of sentence flag (1 if word ends a sentence, 0 otherwise)
  3. lem -- the lemma.
  4. ord -- the word ordinal within the text (starts at 1)
  5. part -- the word part flag. "N" for a word which is not split; "I" for the first part of a split word; "M" for the middle parts of a split word; and "F" for the final part of a split word.
  6. pos -- the part of speech.
  7. reg -- the standard spelling.
  8. spe -- the corrected original spelling.
  9. tok -- The original token.
 

Information Technology | Academic Technologies | Scholarly Technologies 2East Resource Center |
Northwestern Home | Calendar: Plan-It Purple | Sites A-Z | Search
Academic Technologies  NU Library 2East  1970 Campus Drive  Evanston, IL 60208
E-mail: pib@northwestern.edu
Last updated Thu Apr 02 00:04:54 2009   World Wide Web Disclaimer and University Policy Statements   © 2007, 2008 Northwestern University