public class StripWordAttributes
extends java.lang.Object
Usage:
java edu.northwestern.at.morphadorner.tools.stripwordattributes.StripWordAttributes input.xml output.xml output.tab [/[no]id] [/[no]trim]
| input.xml | Input MorphAdornerd xml file. |
| output.xml | Derived adorned file with word element attributes stripped. |
| output.tab | Tab delimited file of word element attribute values. |
| /id or /noid | Optional parameter indicating xml:id should be left attached to each word (<w>) element. Default is /noid which removes the xml:id attribute and value. |
| /trim or /notrim | Optional parameter indicating whether whitespace should be trimmed from the start and end of each XML text line. Default is /notrim, which leaves the original whitespace intact. |
The derived adorned output file "output.xml" has all attributes stripped from each <w> tag.
The attribute values for each "<w>" element in the input.xml file are extracted and output to the tab-separated values output.tab file. The order of the attribute lines matches the order of appearance of the <w> elements in the XML output file. When /id is used, the xml:id value in each <w> element in output.xml can be matched with the corresponding xml:id value in output.tab .
The first line in output.tab contains the attribute names for each column. Each subsequent line in the output.tab file contains at least the following information corresponding to a single word "<w>" element. Some adorned files may add extra word attributes, resulting in more columns.
| Modifier and Type | Field and Description |
|---|---|
protected static java.util.Set<java.lang.String> |
attrsToOmit
Attributes to omit from output attributes file.
|
protected static java.util.Map<java.lang.String,java.lang.String> |
entitiesMap
Standard entities generated by MorphAdorner.
|
protected static java.util.regex.Matcher |
entitiesMatcher
Entities pattern matcher.
|
protected static java.util.regex.Pattern |
entitiesPattern
Entities pattern.
|
protected static java.lang.String |
LINE_SEPARATOR
Line separator.
|
| Constructor and Description |
|---|
StripWordAttributes(java.lang.String inputXMLFileName,
java.lang.String outputXMLFileName,
java.lang.String outputTabFileName,
boolean leaveID,
boolean trimWhitespace)
Create derived adorned files with character offset attributes.
|
| Modifier and Type | Method and Description |
|---|---|
protected static java.lang.String |
cleanAttributeValue(java.lang.String attrValue)
Cleans an attribute value of enclosing quotes and internal entities.
|
protected static void |
displayUsage()
Display program usage.
|
protected static java.util.Map<java.lang.String,java.lang.String> |
fillInMissingAttributes(java.util.Map<java.lang.String,java.lang.String> attrMap,
java.lang.String wordText)
Fill in missing attribute values.
|
protected static java.util.Map<java.lang.String,java.lang.String> |
getAttributes(java.lang.String attrsText,
java.lang.String wordText)
Get map of attribute values for a
|
static void |
main(java.lang.String[] args)
Main program.
|
protected static void |
setMissingValue(java.util.Map<java.lang.String,java.lang.String> attrMap,
java.lang.String attrName,
java.lang.String defaultAttrValue)
Set missing attribute value.
|
protected static final java.lang.String LINE_SEPARATOR
protected static java.util.Set<java.lang.String> attrsToOmit
protected static java.util.Map<java.lang.String,java.lang.String> entitiesMap
protected static java.util.regex.Pattern entitiesPattern
protected static java.util.regex.Matcher entitiesMatcher
public StripWordAttributes(java.lang.String inputXMLFileName,
java.lang.String outputXMLFileName,
java.lang.String outputTabFileName,
boolean leaveID,
boolean trimWhitespace)
inputXMLFileName - Input adorned XML file name.outputXMLFileName - Output modified XML file name.outputTabFileName - Output attribute values file name.leaveID - true to leave xml:id in outputtrimWhitespace - true to trim whitespace from
start and end of input XML lines.public static void main(java.lang.String[] args)
protected static void displayUsage()
protected static java.util.Map<java.lang.String,java.lang.String> getAttributes(java.lang.String attrsText,
java.lang.String wordText)
attrsText - String containing attribute values in
attr="value" form as extracted from
wordText - Word text.protected static java.lang.String cleanAttributeValue(java.lang.String attrValue)
attrValue - Input attribute value with possible
enclosing quotes and internal entities.protected static java.util.Map<java.lang.String,java.lang.String> fillInMissingAttributes(java.util.Map<java.lang.String,java.lang.String> attrMap,
java.lang.String wordText)
attrMap - Map with attribute name -> attribute value
entries.wordText - Word text.protected static void setMissingValue(java.util.Map<java.lang.String,java.lang.String> attrMap,
java.lang.String attrName,
java.lang.String defaultAttrValue)
attrMap - Map with attribute name ->
attribute value entries.attrName - Attribute name.defaultAttrValue - Default attribute value for
attribute attrName if not present
in attrMap.